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In: Kosko, E., NEURAL NETWORKS AND FUZZY SYSTEMS, Prentice-Hall, 1990 

N91-21779 

CHAPTER 17 


FUZZY ASSOCIATIVE MEMORIES 


Fuzzy Systems as Between-Cube Mappings 

In Chapter 16, we introduced continuous or fuzzy sets as points in the unit hypercube 
I n = [0, l] n . Within the cube we were interested in the distance between points. This led 
to measures of the size and fuzziness of a fuzzy set and, more fundamentally, to a measure 
of how much one fuzzy set is a subset of another fuzzy set. This ivithin-cube theory directly 
extends to the continuous case where the space X is a subset of /? n or, in general, where 
X is a subset of products of real or complex spaces. 

The next step is to consider mappings between fuzzy cubes. This level of abstraction 
provides a surprising and fruitful alternative to the propositional and predicate-calculus 
reasoning techniques used in artificial-intelligence (AI) expert systems. It allows us to 
reason with sets instead of propositions. 

The fuzzy set framework is numerical and multidimensional. The AI framework is 
symbolic and one-dimensional, with usually only bivalent expert “rules” or propositions 
allowed. Both frameworks can encode structured knowledge in linguistic form. But the 
fuzzy approach translates the structured knowledge into a flexible numerical framework 
and processes it in a manner that resembles neural network processing. The numerical 
framework also allows fuzzy systems to be adaptively inferred and modified, perhaps with 
neural or statistical techniques, directly from problem domain sample data. 
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Between-cube theory is fuzzy systems theory. A fuzzy set is a point in a cube. A 
fuzzy system is a mapping between cubes. A fuzzy system S maps fuzzy sets to fuzzy 
sets. Thus a fuzzy system S is a transformation S : 7 n — * 7 P . The n-dimensional 
unit hypercube 7” houses all the fuzzy subsets of the domain space, or input universe of 
discourse , X = {xi , . . . , x n }. l p houses all the fuzzy subsets of the range space, or output 
universe of discourse, Y = {t/i, ■ • • * Vp}- X and Y can also be subsets of R and R p . Then 
the fuzzy power sets F(2 X ) and F(2 Y ) replace 7 n and I p . 

In general a fuzzy system S maps families of fuzzy sets to families of fuzzy sets, thus 
S : I ni x ... x 7 nr — > 7 P1 x ... x I p ‘. Here too we can extend the definition of a 
fuzzy system to allow arbitrary products of arbitrary mathematical spaces to serve as the 
domain or range spaces of the fuzzy sets. 

(A technical comment is in order for sake of historical clarification. A tenet, perhaps 
the defining tenet, of the classical theory [Dubois, 1980] of fuzzy sets as functions concerns 
the fuzzy extension of any mathematical function. This tenet holds that any function 
f : X — > Y that maps points in X to points in Y can be extended to map the fuzzy 
subsets of X to the fuzzy subsets of Y. The so-called extension principle is used to define 
the set-function /: F( 2 X ) -» F(2 y ), where 7^(2*) is the fuzzy power set of X, the set 
of all fuzzy subsets of X. The formal definition of the extension principle is complicated. 
The key idea is a supremum of pairwise minima. Unfortunately, the extension principle 
achieves generality at the price of triviality. One can show [Kosko, 1986a-87] that in general 
the extension principle extends functions to fuzzy sets by stripping the fuzzy sets of their 
fuzziness, mapping the fuzzy sets into bit vectors of nearly all Is. This shortcoming, 
combined with the tendency of the extension-principle framework to push fuzzy theory 
into largely inaccessible regions of abstract mathematics, led in part to the development 
of the alternative sets-as-points geometric framework of fuzzy theory.) 

We shall focus on fuzzy systems 5 : 7" -» 7 P that map balls of fuzzy sets in 7" to 
balls of fuzzy sets in 7 P . These continuous fuzzy systems behave as associative memories. 
They map close inputs to close outputs. We shall refer to them as fuzzy associative 
memories, or FAMs. 

The simplest FAM encodes the FAM rule or association (/!,, 7?,), which associates 
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the p-dimensional fuzzy set B, with the n-dimensional fuzzy set A<. These minimal FAMs 
essentially map one ball in J n to one ball in P. They are comparable to simple neural 
networks. But the minimal FAMs need not be adaptively trained. As discussed below, 
structured knowledge of the form “If traffic is heavy in this direction, then keep the stop 
light green longer” can be directly encoded in a Hebbian-style FAM matrix. In practice 
we can eliminate even this matrix. In its place the user encodes the fuzzy-set association 
(HEAVY, LONGER) as a single linguistic entry in a FAM bank matrix. 

In general a FAM system F : P P encodes and processes in parallel a FAM 
bank of m FAM rules (A„ B a ), . . . , (A m , B m ). Each input A to the FAM system activates 
each stored FAM rule to different degree. The minimal FAM that stores (A,,#,) maps 
input A to S', a partially activated version of B,. The more A resembles A,, the more B' 
resembles B,. The corresponding output fuzzy set B combines these partially activated 
fuzzy sets B[, . . . , B' m . In the simplest case B is a weighted average of the partially activated 

sets: 

B = w\B[ + ... + w m B' m , 

where u>, reflects the credibility, frequency, or strength of the fuzzy association (A„ B x ). In 
practice we usually “defuzzify” the output waveform B to a single numerical value y 3 in Y 
by computing the fuzzy centroid of B with respect to the output universe of discourse Y. 

More general still, a FAM system encodes a bank of compound FAM rules that associate 
multiple output or consequent fuzzy sets B, 1 , . • - > B? with multiple input or antecedent fuzzy 
sets A-,...,A-. We can treat compound FAM rules as compound linguistic conditionals. 
Structured knowledge can then be naturally, and in many cases easily, obtained. We 
combine antecedent and consequent sets with logical conjunction, disjunction, or negation. 
For instance, we would interpret the compound association (A 1 , A 2 ; B ) linguistically as 
the compound conditional “IF A 1 is A 1 AND X 2 is A 2 , THEN Y is B” if the comma in 
the fuzzy association (A 1 , A 2 ; B) stood for conjunction instead of, say, disjunction. 

We specify in advance the numerical universes of discourse A 1 , A' 2 , and Y . For each 
universe of discourse X , we specify an appropriate library of fuzzy set values, A r v . . . , A\. 
Contiguous fuzzy sets in a library overlap. In principle a neural network can estimate these 
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libraries of fuzzy sets. In practice this is usually unnecessary. The library sets represent 
a weighted, though overlapping, quantization of the input space X. A different library of 
fuzzy sets similarly quantizes the output space Y. Once the library of fuzzy sets is defined, 
we construct the FAM by choosing appropriate combinations of input and output fuzzy 
sets. We can use adaptive techniques to make, assist, or modify these choices. 

An adaptive FAM (AFAM) is a time-varying FAM system. System parameters grad- 
ually change as the FAM system samples and processes data. Below we discuss how neural 
network algorithms can adaptively infer FAM rules from training data. In principle learn- 
ing can modify other FAM system components, such as the libraries of fuzzy sets or the 
FAM-rule weights W{. 

Below we propose and illustrate an unsupervised adaptive clustering scheme, based on 
competitive learning, for “blindly” generating and refining the bank of FAM rules. In some 
cases we can use supervised learning techniques, though we need additional information 
to accurately generate error estimates. 


FUZZY AND NEURAL FUNCTION ESTIMATORS 

Neural and fuzzy systems estimate sampled functions and behave as associative mem- 
ories. They share a key advantage over traditional statistical-estimation and adaptive- 
control approaches to function estimation. They are model-free estimators. Neural and 
fuzzy systems estimate a function without requiring a mathematical description of how the 
output functionally depends on the input. They “learn from example.” More precisely, 
they learn from samples. 

Both approaches are numerical, can be partially described with theorems, and admit an 
algorithmic characterization that favors silicon and optical implementation. These prop- 
erties distinguish neural and fuzzy approaches from the symbolic processing approaches of 
artificial intelligence. 

Neural and fuzzy systems differ in how they estimate sampled functions. They difFer 
in the kind of samples used, how they represent and store those samples, and how they 
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associatively “inference” or map inputs to outputs. 

These differences appear during system construction. The neural approach requires 
the specification of a nonlinear dynamical system, usually feedforward, the acquisition of 
a sufficiently representative set of numerical training samples, and the encoding of those 
training samples in the dynamical system by repeated learning cycles. The fuzzy system 
requires only that a linguistic “rule matrix” be partially filled in. This task is markedly 
simpler than designing and training a neural network. Once we construct the systems, we 
can present the same numerical inputs to either system. The outputs will be in the same 
numerical space of alternatives. So both systems correspond to a surface or manifold in 
the input-output product space 1x7. We present examples of these surfaces in Chapters 

18 and 19. 

Which system, neural or fuzzy, is more appropriate for a particular problem depends on 
the nature of the problem and the availability of numerical and structured data. To date 
fuzzy techniques have been most successfully applied to control problems. These problems 
often permit comparison with standard control-theoretic and expert-system approaches. 
Neural networks so far seem best applied to ill-defined two-class pattern recognition prob- 
lems (defective or nondefective, bomb or not, etc.). The application of both approaches to 
new problem areas is just beginning, amid varying amounts of enthusiasm and scepticism. 

Fuzzy systems estimate functions with fuzzy set samples (A„ B,). Neural systems use 
numerical point samples (x<, y,). Both kinds of samples are from the input-output product 
space 1x7. Figure 17.1 illustrates the geometry of fuzzy-set and numerical-point samples 

taken from the function /: X — ► Y . 

The fuzzy-set association {Ai, B ,) is sometimes called a “rule.” This is misleading 
since reasoning with sets is not the same as reasoning with propositions. Reasoning with 
sets is harder. Sets are multidimensional, and associations are housed in matrices, not 
conditionals. We must take care how we define each term and operation. We shall refer to 
the antecedent term A ,• in the fuzzy association {Ai, B t ) as the input associant and the 
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consequent term R, as the output associant. 




FIGURE 17.1 Function / maps domain X to range Y. In the first illustra- 
tion we use several numerical point samples (x,, yi) to estimate /: X * Y. 
In the second case we use only a few fuzzy subsets A{ of X and B{ of Y . The 
fuzzy association ( Ai , /?,•) represents system structure, as an adaptive cluster- 
ing algorithm might infer or as an expert might articulate. In practice there are 
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usually fewer different output associants or “rule” consequents 5, than input 
associants or antecedents A,. 


The fuzzy-set sample (A,-, Bf) encodes structure. It represents a mapping itself, a min- 
imal fuzzy association of part of the output space with part of the input space. In practice 
this resembles a meta-rule— IF A,-, THEN B — the type of structured linguistic rule an ex- 
pert might articulate to build an expert-system “knowledge base”. The association might 
also be the result of an adaptive clustering algorithm. 

Consider a fuzzy association that might be used in the intelligent control of a traffic 
light: “If the traffic is heavy in this direction, then keep the light green longer,” The 
fuzzy association is (HEAVY, LONGER). Another fuzzy association might be (LIGHT, 
SHORTER). The fuzzy system encodes each linguistic association or “rule” in a numerical 
fuzzy associative memory (FAM) mapping. The FAM then numerically processes numerical 
input data. A measured description of traffic density (e.g., 150 cars per unit road surface 
area) then corresponds to a unique numerical output (e.g., 3 seconds), the “recalled” 
output. 

The degree to which a particular measurement of traffic density is heavy depends on 
how we define the fuzzy set of heavy traffic. The definition may be obtained from statistical 
or neural clustering of historical data or from pooling the responses of experts. In practice 
the fuzzy engineer and the problem domain expert agree on one of many possible libraries 
of fuzzy set definitions for the variables in question. 

The degree to which the traffic light is kept green longer depends on the degree to 
which the measurement is heavy. In the simplest case the two degrees are the same. In 
general they differ. In actual fuzzy systems the output control variables in this case the 
single variable green light duration — depend on many FAM rule antecedents or associants 
that are activated to different degrees by incoming data. 
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Neural vs. Fuzzy Representation of Structured Knowledge 


The functional distinction between how fuzzy and neural systems differ begins with 
how they represent structured knowledge. How would a neural network encode the same 
associative information? How would a neural network encode the structured knowledge 
“If the traffic is heavy in this direction, then keep the light green longer”? 

The simplest method is to encode two associated numerical vectors. One vector rep- 
resents the input associant HEAVY. The other vector represents the output associant 
LONGER. But this is too simple. For the neural network’s fault tolerance now works 
to its disadvantage. The network tends to reconstruct partial inputs to complete sample 
inputs. It erases the desired partial degrees of activation. If an input is close to A,, the 
output will tend to be B{. If the output is distant from A,, the output will tend to be some 
other sampled output vector or a spurious output altogether. 

A better neural approach is to encode a mapping from the heavy-traffic subspace to 
the longer-time subspace. Then the neural network needs a representative sample set to 
capture this structure. Statistical networks, such as adaptive vector quantizers, may need 
thousands of statistically representative samples. Feedforward multi-layer neural networks 
trained with the backpropagation algorithm may need hundreds of representative numerical 
input-output pairs and may need to recycle these samples tens of thousands of times in 
the learning process. 

The neural approach suffers a deeper problem than just the computational burden of 
training. What does it encode? How do we know the network encodes the original struc- 
ture? What does it recall? There is no natural inferential audit trail. System nonlinearities 
wash it away. Unlike an expert system, we do not know which inferential paths the network 
uses to reach a given output or even which inferential paths exist. There is only a system of 
synchronous or asynchronous nonlinear functions. Unlike, say, the adaptive Kalman filter, 
we cannot appeal to a postulated mathematical model of how the output state depends on 
the input state. Model-free estimation is, after all, the central computational advantage 
of neural networks. The cost is system inscrutability. 
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We are left with an unstructured computational black box. We do not know what the 
neural network encoded during training or what it will encode or forget in further training. 
(For competitive adaptive vector quantizers we do know that sample-space centroids are 
asymptotically estimated.) We can characterize the neural network’s behavior only by 
exhaustively passing all inputs through the black box and recording the recalled outputs. 
The characterization may be in terms of a summary scalar like mean-squared error. 

This black-box characterization of the network’s behavior involves a computational 
dilemma. On the one hand, for most problems the number of input-output cases we need 
to check is computationally prohibitive. On the other, when the number of input-output 
cases is tractable, we may as well store these pairs and appeal to them directly, and without 
error, as a look-up table. In the first case the neural network is unreliable. In the second 
case it is unnecessary. 

A further problem is sample generation. Where did the original numerical point samples 
come from? Was an expert asked to give numbers? How reliable are such numerical vectors, 
especially when the expert feels most comfortable giving the original linguistic data? This 
procedure seems at most as reliable as the expert-system method of asking an expert to 
give condition-action rules with numerical uncertainty weights. 

Statistical neural estimators require a “statistically representative” sample set. We may 
need to randomly “create” these samples from an initial small sample set by bootstrap tech- 
niques or by random-number generation of points clustered near the original samples. Both 
sample-augmentation procedures assume that the initial sample set sufficiently represents 
the underlying probability distribution. The problem of where the original sample set 
comes from remains. The fuzziness of the notion “statistically representative” compounds 
the problem. In general we do not know in advance how well a given sample set reflects an 
unknown underlying distribution of points. Indeed when the network is adapting on-line, 
we know only past samples. The remainder of the sample set is in the unsampled future. 

In contrast, fuzzy systems directly encode the linguistic sample (HEAVY, LONGER) in 
a dedicated numerical matrix. The default encoding technique is the fuzzy Hebb procedure 
discussed below. For practical problems, as mentioned above, the numerical matrix need 
not be stored. Indeed it need not even be formed. Certain numerical inputs permit this 
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simplification, as we shall see below. In general we describe inputs by an uncertainty 
distribution, probabilistic or fuzzy. Then we must use the entire matrix. 

For instance, if a heavy traffic input is simply the number 150, we can omit the FAM 
matrix. But if the input is a Gaussian curve with mean 150, then in principle we must 
process the vector input with a FAM matrix. (In practice we might use only the mean.) 
This difference is explained below. The dimensions of the linguistic FAM bank matrix 
are usually small. The dimensions reflect the quantization levels of the input and output 
spaces. 

The fuzzy approach combines the purely numerical approaches of neural networks and 
mathematical modeling with the symbolic, structure-rich approaches of artificial intelli- 
gence. We acquire knowledge symbolically — or numerically if we use adaptive techniques 
— but represent it numerically. We also process data numerically. Adaptive FAM rules 
correspond to common-sense, often non-articulated, behavioral rules that improve with 
experience. 

We can acquire structured expertise in the fuzzy terminology of the knowledge source, 
the “expert.” This requires little or no force-fitting. Such is the expressive power of 
fuzziness. Yet in the numerical domain we can prove theorems and design hardware. 

This approach does not abandon neural network techniques. Instead, it limits them to 
unstructured parameter and state estimation, pattern recognition, and cluster formation. 
The system architecture remains fuzzy, though perhaps adaptively so. In the same spirit, 
no one believes that the brain is a single unstructured neural network. 


FAMS as Mappings 

Fuzzy associative memories (FAMs) are transformations. FAMs map fuzzy sets 
to fuzzy sets . They map unit cubes to unit cubes. This is evident in Figure 17.1. In 
the simplest case the FAM consists of a single association, such as (HEAVY, LONGER). 
In general the FAM consists of a bank of different FAM associations. Each association 
is represented by a different numerical FAM matrix, or a different entry in a FAM-bank 
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matrix. These matrices are not combined as with neural network associative memory 
(outer-product) matrices. (An exception is the fuzzy cognitive map [Kosko, 1988; Taber, 
1987, 1990].) The matrices are stored separately but accessed in parallel. 

We begin with single-association FAMs. For concreteness let the fuzzy-set pair (A, B) 
encode the traffic-control association (HEAVY, LIGHT). We quantize the domain of traffic 
density to the n numerical variables Xi, x 2 , . . . , x n . We quantize the range of green-light 
duration to the p variables y u y 2 , ■■■, y p ■ The elements x, and y } belong respectively to 
the ground sets X = {xi, ..., x n } and Y = {j/i, y p }. Xi might represent zero 
traffic density. y v might represent 10 seconds. 

The fuzzy sets A and B are fuzzy subsets of X and Y. So A is point in the n- 
dimensional unit hypercube I n = [0, l] n , and B is a point in the p-dimensional fuzzy 
cube I p . Equivalently, we can think of A and B as membership functions m,j\ and tub 
mapping the elements x, of X and y j of Y to degrees of membership in [0, 1]. The 
membership values, or fit (fuzzy unit) values, indicate how much x, belongs to or fits in 
subset A y and how much yj belongs to B. We describe this with the abstract functions 
m A ■ x — > [0, 1] and ms : Y — ♦ [0, 1]. We shall freely view sets both as functions 
and as points. 

The geometric sets-as-points interpretation of fuzzy sets A and B as points in unit 
cubes allows a natural vector representation. We represent A and B by the numerical fit 
vectors A = (aj, ..., a n ) and B = (b u b p ), where a, = m„(x,) and b 3 = m B (yj). 
We can interpret the identifications A = HEAVY and B = LONGER to suit the problem 
at hand. Intuitively the a, values should increase as the index i increases, perhaps ap- 
proximating a sigmoid membership function. Figure 17.2 illustrates three possible fuzzy 
subsets of the universe of discourse X . 
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TRAFFIC DENSITY 


FIGURE 17.2 Three possible fuzzy subsets of traffic density space X . Each 

fuzzy sample corresponds to such a subset. We draw the fuzzy sets as contin- 
uous membership functions. In practice membership values are quantized. So 
the sets are points in the unit hypercube /". Each fuzzy sample corresponds 
to such a subset. 

Fuzzy Vector-Matrix Multiplication: Max-Min Composition 

Fuzzy vector-matrix multiplication is similar to classical vector-matrix multiplication. 
We replace pairwise multiplications with pairwise minima. We replace column (row) sums 
with column (row) maxima. We denote this fuzzy vector-matrix composition relation, 
or the max-min composition relation [Klir, 1988], by the composition operator “o”. For 
row fit vectors A and B and fuzzy n-by-p matrix M (a point in 7 nXp ): 

A o M = B , (1) 
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where we compute the “recalled” component bj by taking the fuzzy inner product of fit 
vector A with the jth column of M : 

bj = max min(a,-,m,j) . (2) 

l<i<n 

Suppose we compose the fit vector A = (.3 .4 .8 1) with the fuzzy matrix M given by 


M 


( .2 .8 .7 ^ 
.7 .6 .6 
.8 .1 .5 
^ 0 .2 .3 j 


Then we compute the “recalled” fit vector B = A o M component-wise as 


= max{min(.3, .2), min(.4, .7), min(.8, .8), min(l, 0)} 

= max(.2, .4, .8, 0) 

= -8 , 

= max(.3, .4, .1, .2) 

= -4 , 

63 = max(.3, .4, .5, .3) 

= .5 . 

So B — (.8 .4 .5). If we somehow encoded (A, B ) in the FAM matrix M, we would say 
that the FAM system exhibits perfect recall in the forward direction. 

The neural interpretation of max-min composition is that each neuron in field Fy 
(or field Fb ) generates its signal/activation value by fuzzy linear composition. Passing 
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information back through M T allows us to interpret the fuzzy system as a bidirectional as- 
sociative memory (BAM). The Bidirectional FAM Theorems below characterize successful 
BAM recall for fuzzy correlation or Hebbian learning. 

For completeness we also mention the max-product composition operator, which 
replaces minimum with product in (2): 

6; = max a, m,; . 

l<»'<n 

In the fuzzy literature this composition operator is often confused with the fuzzy correlation 
encoding scheme discussed below. Max-product composition is a method for “multiply- 
ing” fuzzy matrices or vectors. Fuzzy correlation, which also uses pairwise products of 
fit values, is a method for constructing fuzzy matrices. In practice, and in the following 
discussion, we use only max-mi n composition. 


FUZZY HEBB FAMs 

Most fuzzy systems found in applications are fuzzy Hebb FAMs [Kosko, 1986b]. They 
axe fuzzy systems 5 : /" — ► I p constructed in a simple neural-like manner. As discussed 
in Chapter 4, in neural network theory we interpret the classical Hebbian hypothesis of 
correlation synaptic learning [Hebb, 1949] as unsupervised learning with the signal product 
Si Sj‘. 


rhij = — m,j + 5i(x.) Sj(yj) . (3) 

For a given pair of bipolar vectors (X, K), the neural interpretation gives the outer-product 
correlation matrix 


M = X T Y . 


( 4 ) 


The fuzzy Hebb matrix is similarly defined pointwise by the minimum of the “sig- 
nals” a, and bj, an encoding scheme we shall call correlation-minimum encoding: 
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mij = min(a,-, bj) , (5) 

given in matrix notation as the fuzzy outer-product 

M = A t o B . (6) 

Mamdani [1977] and Togai [1986] independently arrived at the fuzzy Hebbian prescrip- 
tion (5) as a multi-valued logical-implication operator: truth(ai — *• b,j = min(oj, bj). 

The min operator, though, is a symmetric truth operator. So it does not properly gen- 
eralize the classical implication P — * Q, which is false if and only if the antecedent P 
is true and the consequent Q is false, t(P ) = 1 and t(Q ) = 0. In contrast, a like desire 
to define a “conditional possibility” matrix pointwise with continuous implication values 
led Zadeh [1983] to choose the Lukasiewicz implication operator: m,_, = truth(a, — *• 

bj) = min(l, 1 — a, + bj). The problem with the Lukasiewicz operator is that it usually 
unity. For min(l, 1 - a, + bj) < l iff a, > bj. Most entries of the resulting matrix M 
are unity or near unity. This ignores the information in the association (A, B). So A! o M 
tends to equal the largest fit value a' k for any system input A'. 

We construct an autoassociative fuzzy Hebb FAM matrix by encoding the redundant 
pair (A, A) in (6), as the fuzzy auto-correlation matrix: 

M — A T o A . (7) 

In the previous example the matrix M was such that the input A = (.3 .4 .8 1) 
recalled fit vector B = (.8 .4 .5) upon max-min composition: A o M = B, Will 

B still be recalled if we replace the original matrix M with the fuzzy Hebb matrix found 
with (6)? Substituting A and B in (6) gives 



( .3 ^ 


f .3 .3 .3 ^ 


.4 


.4 .4 .4 

II 

o 

ii 

* 


o 

bo 

II 



bo 


.8 .4 .5 




^■8 -4 .5 J 
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This fuzzy Hebb matrix M illustrates two key properties. First, the ith row of M is 
the pairwise minimum of a, and the output associant B. Symmetrically, the jith column 
of M is the pairwise minimum of bj and the input associant A: 


M = 


a, A B 


a n A B 


= ft A A 7 | . . . | b m A A T ] , 

where the cap operator denotes pairwise minimum: flj A m 

di A B indicates component-wise minimum: 


( 8 ) 

( 9 ) 


min(a,, bj). The term 


a< A B = (a, A A b n ) , (10) 

Hence if some a* = 1, then the kth row of M is B . If some 6/ — 1, the /th column of 
M is A. More generally, if some a* is at least as large as every bj, then the fcth row of the 
fuzzy Hebb matrix M is B. 

Second, the third and fourth columns of M are just the fit vector B. Yet no column 
is A. This allows perfect recall in the forward direction, A o M = B, but not in the 
backward direction, B o M T ^ A: 


A o M = (.8 .4 .5) = B , 

B o M t = (.3 .4 .8 .8) = A' C A . 

A' is a proper subset of A ; A' ^ A and S(A', ^4) = 1, where S measures the degree of 
subsethood of A' in A, as discussed in Chapter 16. In other words, a\ < a, for each i and 
a' k < a k for at least one k. The Bidirectional FAM Theorems below show that this is a 
general property: If B' — A o M differs from B , then B' is a proper subset of B. Hence 
fuzzy subsets truly map to fuzzy subsets. 
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The Bidirectional FAM Theorem for Correlation-Minimum En- 
coding 


Analysis of FAM recall uses the traditional [Klir, 1988] fuzzy set notions of the height 
and the normality of fuzzy sets. The height H(A) of fuzzy set A is the maximum fit value 
of A: 


n(A) = max a, . 

A fuzzy set is normal if H{A) = 1, if at least one fit value a* is maximal: a * = 1. In 
practice fuzzy sets are usually normal. We can extend a nonnormal fuzzy set to a normal 
fuzzy set by adding a dummy dimension with corresponding fit value a n+ i = 1. 

Recall accuracy in fuzzy Hebb FAMs constructed with correlation-minimum encoding 
depends on the heights H(A) and H(B). Normal fuzzy sets exhibit perfect recall. Indeed 
(A, B ) is a bidirectional fixed point — A o M = B and B o M T = A — if and only if 
H(A) = H(B ), which always holds if A and B are normal. This is the content of the 
Bidirectional FAM Theorem [Kosko, 1986a] for correlation-minimum encoding. Below we 
present a similar theorem for correlation-product encoding. 


Correlation- Mini mum Bidirectional FAM Theorem. If M 


A t o B, then 


0) 

A o M 

— 

B 

iff 

H(A) > H(B) , 

00 

B o M t 

= 

A 

iff 

H(B) > H(A) , 

(iii) 

A' o M 

c 

B 


for any A! . 

(iv) 

B' o M t 

c 

A 


for any B' , 


Proof. Observe that the height H(A) is the fuzzy norm of A: 
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Then 


A o A t = max a* A a, = max a,- = H(A) . 
i i 


A o M - A o (A 7 o B) 

= (yl o A t ) o B 
= H(A) o B 
= H(A) A B . 

So H(A) A B = B iff H(A) > H(B ), establishing (i). Now suppose A' is an arbitrary 
fit vector in I n . Then 


A' o M = (A' o A T ) o B 

= {A' o A t ) A B , 

which establishes (iii) . A similar argument using M T = B T o A establishes (ii) and (iv). 

Q.E.D. 


The equality A o A T = H(A) implies an immediate corollary of the Bidirectional 
FAM Theorem. Supersets A' D A behave the same as the encoded input associant 
A : A 1 o M = B \{ A o M = B. Fuzzy Hebb FAMs ignore the information in the 
difference A' — A, when A' C A'. 

Correlation- Pro duct Encoding 

An alternative fuzzy Hebbian encoding scheme is correlation-product encoding. 
The standard mathematical outer product of the fit vectors A and B forms the FAM 
matrix M. This is given pointwise as 


20 


11:111 ll 



and in matrix notation as 


M = A t B . 


So the ith row of M is just the fit-scaled fuzzy set a, 5, and the jth column of M is 


( 12 ) 
b } A t : 


aj B 


M = 


[ a n B J 

= [biA T \ ... | 6 m A t ] 


(13) 

(14) 


If A = (.3 .4 .8 1) and B = (.8 .4 .5) as above, we encode the FAM rule (A, B) with 
correlation-product in the following matrix M: 


M 


( .24 .12 .15 ^ 
.32 .16 .2 
.64 .32 .4 

1-8 -4 -5 J 


Note that if A 9 = (0 0 0 1), then A' o M = B. The output associant B is recalled 
to maximal degree. If A' = (1 0 0 0), then A! o M = (.24 .12 .15). The output B is 
recalled only to degree .3. 

Correlation-minimum encoding produces a matrix of clipped B sets. Correlation- 
product encoding produces a matrix of scaled B sets. In membership function plots, 
the scaled fuzzy sets a t B all have the same shape as B . The clipped fuzzy sets a, A B 
are largely flat. In this sense correlation-product encoding preserves more information 
than correlation-minimum encoding, an important point in fuzzy applications when out- 
put fuzzy sets are added together as in equation (17) below. In the fuzzy- applications 
literature this often leads to the selection of correlation-product encoding. 



Unfortunately, in the fuzzy-applications literature the correlation-product encoding 
scheme is invariably confused with the max-product composition method of recall or infer- 
ence, as mentioned above. This confusion is so widespread it warrants formal clarification. 

In practice, and in the fuzzy control applications developed in Chapters 18 and 19, the 
input fuzzy set A! is a binary vector with one 1 and all other elements 0 a row of the 
n-by-n identity matrix. A! represents the occurrence of the crisp measurement datum x t , 
such as a traffic density value of 30. When applied to the encoded FAM rule (A, 5), the 
measurement value x, activates A to degree a,*. This is part of the max-min composition 
recall process, for A' o M = (A f o A T ) o B - a, A Bor a, B depending on whether 
correlation-minimum or correlation-product encoding is used. We activate or fire the 
output associant B of the “rule” to degree a,-. 

Since the values a, are binary, a,- m t y = a* A m,j. So the max-min and max- 
product composition operators coincide. We avoid this confusion by referring to both 
the recall process and the correlation encoding scheme as correlation-minimum infer- 
ence when correlation-minimum encoding is combined with max-min composition, and 
as correlation-product inference when correlation-product encoding is combined with 
max-min composition. 

We now prove the correlation-product version of the Bidirectional FAM Theorem. 

Correlation-Product Bidirectional FAM Theorem. If Af = A T B and A and B 
are non-null fit vectors, then 


0) 

A o M 

= 

B 

iff 

H 

II 

Sq 

(») 

B o M T 

= 

A 

iff 

H(B) = 1 

(Hi) 

A' o M 

C 

B 


for any A ! - 

(iv) 

B' o M t 

C 

A 


for any B 1 . 
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Proof. 


A o M = A o {A T B) 
= (A o A t ) B 
= H{A ) B . 


Since B is not the empty set, H(A) B = B iff H(A) = 1, establishing (i). (AoM = B 
holds trivially if B is the empty set.) For an arbitrary fit vector A! in 7 n : 


A' o M = (A' o A T ) B 
C H{A) B 
C B , 

since A! o A < 7/(A), establishing (iii). (ii) and (iv) are proved similarly using 

M t = B T A. Q.E.D. 

Superimposing FAM Rules 

Now suppose we have m FAM rules or associations (A\, B \), . . . , (A m , B m ). The fuzzy 
Hebb encoding scheme (6) leads to m FAM matrices Mj, — , M m to encode the associa- 
tions. The natural neural-network temptation is to add, or in this case maximum, the m 
matrices pointwise to distributively encode the associations in a single matrix M: 

M = max Mk • (15) 

This superimposition scheme fails for fuzzy Hebbian encoding. The superimposed result 
tends to be the matrix A T o B , where A and B are the pointwise maximum of the respective 
m fit vectors Ak and Bk ■ We can see this from the pointwise inequality 
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max min(af,6j) < min( max af, max 6*) . (16) 

l<JKm V •’ 3> V l<Jt<m l<*<m 3 

Inequality (16) tends to hold with equality as m increases since all maximum terms ap- 
proach unity. We lose the information in the m associations (Ak, Bk)- 

The fuzzy approach to the superimposition problem is to additively superimpose the m 
recalled vectors B k instead of the fuzzy Hebb matrices Mk - B' k and Mjt are given by 

A o M k = A o (A T k o B k ) 

= B' k , 

for any fit-vector input A applied in parallel to the bank of FAM rules (Ak,Bk)- This 
requires separately storing the m associations (Ak, B k ), as if each association in the FAM 
bank were a separate feedforward neural network. 

Separate storage of FAM associations is costly but provides an “audit trail” of the 
FAM inference procedure. The user can directly determine which FAM rules contributed 
how much membership activation to a “concluded” output. Separate storage also pro- 
vides knowledge-base modularity. The user can add or delete FAM-structured knowledge 
without disturbing stored knowledge. Both of these benefits are advantages over a pure 
neural-network architecture for encoding the same associations ( Ak , Bk)- Of course we can 
use neural networks exogenously to estimate, or even individually house, the associations 
{Ak,Bk). 

Separate storage of FAM rules brings out another distinction between FAM systems 
and neural networks. A fit- vector input A activates all the FAM rules (Ak,Bk) in parallel 
but to different degrees. If A only partially “satisfies” the antecedent associant Ak, the 
consequent associant Bk is only partially activated. If A does not satisfy Ak at all, B k does 
not activate at all. B' k is the null vector. 

Neural networks behave differently. They try to reconstruct the entire association 
(Ak, B k ) when stimulated with A. If A and Ak mismatch severely, a neural network will 
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tend to emit a non-null output B ' k , perhaps the result of the network dynamical system 
falling into a “spurious” attractor in the state space. This may be desirable for metrical 
classification problems. It is undesirable for inferential problems and, arguably, for associa- 
tive memory problems. When we ask an expert a question outside his field of knowledge, 
in many cases it is more prudent for him to give no response than to give an educated, 
though wild, guess. 


Recalled Outputs and “Defuzzification” 

The recalled fit-vector output B is a weighted sum of the individual recalled vectors 

B' k : 


B = Y. ■»* B ’k . 0 7 ) 

fc=l 

where the nonnegative weight w k summarizes the credibility or strength of the fcth FAM 
rule (j4*,i?fc). The credibility weights w k are immediate candidates for adaptive modifica- 
tion. In practice we choose w\ = ... = w m = 1 as a default. 

In principle, though not in practice, the recalled fit-vector output is a normalized sum 
of the B’ k fit vectors. This keeps the components of B unit-interval valued. We do not 
use normalization in practice because we invariably “defuzzify” the output distribution B 
to produce a single numerical output, a single value in the output universe of discourse 
Y — {t/i, . . . , y p ). The information in the output waveform B resides largely in the 

relative values of the membership degrees. 

The simplest defuzzification scheme is to choose that element y m&x that has maximal 
membership in the output fuzzy set B: 

m B { J/max) = max m B {yj) . ( 18 ) 

l<J<fc 

The popular probabilistic methods of maximum-likelihood and maximum-a-posteriori pa- 
rameter estimation motivate this maximum-membership defuzzification scheme. The 
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maximum-membership scheme (18) is also computationally light. 

There are two fundamental problems with the maximum-membership defuzzification 
scheme. First, the mode of the B distribution is not unique. This is especially troublesome 
with correlation-minimum encoding, as the representation (8) shows, and somewhat less 
troublesome with correlation-product encoding. Since the minimum operator clips off the 
top of the B k fit vectors, the additively combined output fit vector B tends to be flat over 
many regions of universe of discourse Y. For continuous membership functions this leads 
to infinitely many modes. Even for quantized fuzzy sets, there may be many modes. 

In practice we can average multiple modes. For large FAM banks of “independent” 
FAM rules, some form of the Central Limit Theorem (whose proof ultimately depends 
on Fourier transformability not probability) tends to apply. The waveform B tends to 
resemble a Gaussian membership function. So a unique mode tends to emerge. It tends 
to emerge with fewer samples if we use correlation-product encoding. 

Second, the maximum-membership scheme ignores the information in much of the 
waveform B. Again correlation-minimum encoding compounds the problem. In practice 
B is often highly asymmetric, even if it is unimodal. Infinitely many output distributions 
can share the same mode. 

The natural alternative is the fuzzy centroid defuzzification scheme. We directly 
compute the real- valued output as a normalized convex combination of fit values, the fuzzy 
centroid B of fit-vector B with respect to output space Y : 

p 

J2 Vi 

B = - ( 19 ) 

J2 m s{yj ) 

j=i 

The fuzzy centroid is unique and uses all the information in the output distribution B. Foi 
symmetric unimodal distributions the mode and fuzzy centroid coincide. In many cases 
we must replace the discrete sums in (19) with integrals over continuously infinite spaces. 
We show in Chapter 19, though, that for libraries of trapezoidal fuzzy sets we can replace 
such a ratio of integrals with a ratio of simple discrete sums. 

Note that computing the centroid (19) is the only step in the FAM inference procedure 
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that requires division. All other operations are inner products, pairwise minima, and ad- 
ditions. This promises realization in a fuzzy optical processor. Already some form of this 
FAM-inference scheme has led to digital [Togai, 1986] and analog [Yamakawa, 1987-88] 

VLSI circuitry. 


FAM System Architecture 


Figure 17.3 schematizes the architecture of the nonlinear FAM system F. Note that F 
maps fuzzy sets to fuzzy sets: F(A) = B. So F is in fact a fuzzy-system transformation 
F . /» jp. In practice A is a bit vector with one unity value, a, = 1, and all other 

fit values zero, a,j — 0. 

The output fuzzy set B is usually defuzzified with the centroid technique to produce an 
exact element yj in the output universe of discourse Y . In effect defuzzification produces 
an output binary vector O, again with one element 1 and the rest Os. At this level the FAM 
system F maps sets to sets, reducing the fuzzy system F to a mapping between Boolean 
cubes, F : {0,1}" -*• {0, 1} P - In many applications we model X and Y as continuous 
universes of discourse. So n and p are quite large. We shall call such systems binary 
input-output FAMs. 
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FAM SYSTEM 


FIGURE 17.3 FAM system architecture. The FAM system F maps fuzzy 
sets in the unit cube /” to fuzzy sets in the unit cube I p . Binary input fuzzy 
sets are often used in practice to model exact input data. In general only an 
uncertainty estimate of the system state is available. So A is a proper fuzzy set. 
The user can defuzzify output fuzzy set B to yield exact output data, reducing 
the FAM system to a mapping between Boolean cubes. 


Binary Input-Output FAMs: Inverted Pendulum Example 

Binary input-output FAMs (BIOFAMs) are the most popular fuzzy systems for appli- 
cations. BIOFAMs map system state- variable data to control data. In the case of traffic 
control, a BIOFAM maps traffic densities to green (and red) light durations. 

BIOFAMs easily extend to multiple FAM rule antecedents, to mappings from product 
cubes to product cubes. There has been little theoretical justification for this extension, 
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aside from Mamdani’s [1977] original suggestion to multiply relational matrices. The ex- 
tension to multi-antecedent FAM rules is easier applied than formally explained. In the 
next section we present a general explanation for dealing with multi-antecedent FAM rules. 
First, though, we present the BIOFAM algorithm by illustrating it, and the FAM construc- 
tion procedure, on an archetypical control problem. 

Consider an inverted pendulum. In particular, consider how to adjust a motor to bal- 
ance an inverted pendulum in two dimensions. The inverted pendulum is a classical control 
problem. It admits a math-model control solution. This provides a formal benchmark for 
BIOFAM pendulum controllers. 

There are two state variables and one control variable. The first state variable is the 
angle 0 that the pendulum shaft makes with the vertical. Zero angle corresponds to the 
vertical position. Positive angles are to the right of the vertical, negative angles to the left. 

The second state variable is the angular velocity A0. In practice we approximate the 
instantaneous angular velocity A 0 as the difference between the present angle measurement 
0 t and the previous angle measurement Ot-t- 

A 9t = 0t — 0t- 1 

The control variable is the motor current or angular velocity tv The velocity can also 
be positive or negative. We expect that if the pendulum falls to the right, the motor 
velocity should be negative to compensate. If the pendulum falls to the left, the motor 
velocity should be positive. If the pendulum successfully balances at the vertical, the motor 
velocity should be zero. 

The real line R is the universe of discourse of the three variables. In practice we 
restrict each universe of discourse to a comparatively small interval, such as [—90,90] for 
the pendulum angle, centered about zero. 

We can quantize each universe of discourse into five overlapping fuzzy sets. We know 
that the system variables can be positive, zero, or negative. We can quantize the magni- 
tudes of the system variables finely or coarsely. Suppose we quantize. the magnitudes as 
small, medium, and large. This leads to seven linguistic fuzzy set values: 
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NL: Negative Large 

NM: Negative Medium 

NS: Negative Small 

ZE: Zero 

PS: Positive Small 

PM: Positive Medium 

PL: Positive Large 


For example, 0 is a fuzzy variable that takes NL as a fuzzy set value. Different fuzzy 
quantizations of the angle universe of discourse allow the fuzzy variable 0 to assume differ- 
ent fuzzy set values. The expressive power of the FAM approach stems from these fuzzy-set 
quantizations. In one stroke we reduce system dimensions, and we describe a nonlinear 
numerical process with linguistic common-sense terms. 

We are not concerned with the exact shape of the fuzzy sets defined on each of the 
three universes of discourse. In practice the quantizing fuzzy sets are usually symmetiic 
triangles or trapezoids centered about representive values. (We can think of such sets as 
fuzzy numbers .) The set ZE may be a Gaussian curve for the pendulum angle 0 , a triangle 
for the angular velocity A0, and a trapezoid for the velocity v. But all the ZE fuzzy sets 
will be centered about the numerical value zero, which will have maximum membership in 
the set of zero values. 

How much should contiguous fuzzy sets overlap? This design issue depends on the 
problem at hand. Too much overlap blurs the distinction between the fuzzy set values. 
Too little overlap tends to resemble bivalent control, producing overshoot and undershoot. 
In Chapter 19 we determine experimentally the following default heuristic for ideal overlap: 
Contiguous fuzzy sets in a library should overlap approximately 25%. 

FAM rules are triples, such as (NM, Z\ PM). They describe how to modify the con- 
trol variable for observed values of the pendulum state variables. A FAM rule associates 
a motor-velocity fuzzy set value with a pendulum-angle fuzzy set value and an angular- 
velocity fuzzy set value. So we can interpret the triple (NM,Z\ PM) as the set-level 
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implication 


IF the pendulum angle 0 is negative but medium 

AND the angular velocity AO is about zero , 

THEN the motor velocity should be positive but medium . 


These commonsensical FAM rules are comparatively easy to articulate in natural language. 
Consider a terser linguistic version of the same three-antecedent FAM rule: 


IF 0 = NM AND AO — ZE , 
THEN v = PM . 


Even this mild level of formalism may inhibit the knowledge acquisition process. On the 
other hand, the still terser FAM triple (NM, ZE\ PM) allows knowledge to be acquired 
simply by filling in a few entries in a linguistic FAM-bank matrix. In practice this often 
allows a working system to be developed in hours, if not minutes. 

We specify the pendulum FAM system when we choose a FAM bank of two- antecedent 
FAM rules. Perhaps the first FAM rule to choose is the steady-state FAM rule: (ZE, ZE ; ZE). 
The steady-state FAM rule describes what to do in equilibrium. For the inverted pendulum 
we should do nothing. 

This is typical of many control problems that require nulling a scalar error measure. 
We can control multivariable problems by nulling the norms of the system error vector 
and error- velocity vectors, or, better, by directly nulling the individual scalar variables. 
(Chapter 19 shows how error nulling can control a realtime target tracking system.) Error 
nulling tractably extends the FAM methodology to nonlinear estimation, control, and 
decision problems of high dimension. 
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The pendulum FAM bank is a 7-by-7 matrix with linguistic fuzzy-set entries. We index 
the columns by the seven fuzzy sets that quantize the angle 0 universe of discourse. We 
index the rows by the seven fuzzy sets that quantize the angular velocity A 0 universe of 
discourse. 

Each matrix entry is one of seven motor- velocity fuzzy-set values. Since a FAM rule is a 
mapping or function, there is exactly one output velocity value for every pair of angle and 
angular- velocity values. So the 49 entries in the FAM bank matrix represent the 49 possible 
two-antecedent FAM rules. In practice most of the entries are blank. In the adaptive FAM 
case discussed below, we adaptively generate the entries from process sample data. 

Commonsense dictates the entries in the pendulum FAM bank matrix. Suppose the 
pendulum is not changing. So A0 = ZE . If the pendulum is to the right of vertical, 
the motor velocity should be negative to compensate. The farther the pendulum is to 
the right, the larger the negative motor velocity should be. The motor velocity should 
be positive if the pendulum is to the left. So the fourth row of the FAM bank matrix, 
which corresponds to A0 = ZE , should be the ordinal inverse of the 9 row values. This 
assignment includes the steady-state FAM rule ( ZE , ZE ; ZE ). 

Now suppose the angle 0 is zero but the pendulum is moving. If the angular velocity is 
negative, the pendulum will overshoot to the left. So the motor velocity should be positive 
to compensate. If the angular velocity is positive, the motor velocity should be negative. 
The greater the angular velocity is in magnitude, the greater the motor velocity should 
be in magnitude. So the fourth column of the FAM bank matrix, which corresponds to 
9 = ZE, should be the ordinal inverse of the A 0 column values. This assignment also 
includes the steady-state FAM rule. 

Positive 0 values with negative A 9 values should produce negative motor velocity values, 
since the pendulum is heading toward the vertical. So (PS, NS ; NS) is a candidate FAM 
rule. Symmetrically, negative 9 values with positive A0 values should produce positive 
motor velocity values. So (NS, PS] PS) is another candidate FAM rule. 

This gives 15 FAM rules altogether. In practice these rules are more than sufficient to 
successfully balance an inverted pendulum. Different, and smaller, subsets of FAM rules 
may also successfully balance the pendulum. 
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We can represent the bank of 15 FAM rules as the 7-by-7 linguistic matrix 
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The BIOFAM system F also admits a geometric interpretation. The set of all possible 
input-outpairs ( 6 , Ad; F(d, A 0)) defines a FAM surface in the input-output product space, 
in this case in R 3 . We plot examples of these control surfaces in Chapters 18 and 19. 

The BIOFAM inference procedure activates in parallel the antecedents of all 15 FAM 
rules. The binary or pulse nature of inputs picks off single fit values from the quantizing 
fuzzy sets. We can use either the correlation-minimum or correlation-product inferenc- 
ing technique. For simplicity we shall illustrate the procedure with correlation-minimum 
inferencing. 

Suppose the current pendulum angle 9 is 15 degrees and the angular velocity A 6 is 
—10. This amounts to passing two bit vectors of one 1 and all else 0 through the BIOFAM 
system. What is the corresponding motor velocity value v = F(15, — 10)? 

Consider first how the input data pair (15, -10) activates steady-state FAM rule ( ZE , ZE; 
ZE). Suppose we define the antecedent and consequent fuzzy sets for ZE with the trian- 
gular fuzzy set membership functions in Figure 17.4. Then the angle datum 15 is a zero 
angle value to degree .2 : m e ZE ( 15) = .2. The angular velocity datum -10 is a zero 
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angular velocity value to degree .5 : m^ e E ( — 10) = .5. 

We combine the antecedent fit values with minimum or maximum according as the 
antecedent fuzzy sets are combined with the conjunctive AND or the disjunctive OR. 
Intuitively, it should be at least as difficult to satisfy both antecedent conditions as to 
satisfy either one separately. 

The FAM rule notation ( ZE,ZE ; ZE) implicitly assumes that antecedent fuzzy sets 
are combined conjunctively with AND. So the data satisfy the compound antecedent of 
the FAM rule ( ZE , ZE ; ZE) to degree 

min(m| E (15), mf|(-10)) = min(.2, .5) 

= .2 . 

Clearly this methodology extends to any number of antecedent terms connected with ar- 
bitrary logical (set-theoretical) connectives. 

The system should now activate the consequent fuzzy set of zero motor velocity values 
to degree .2. This is not the same as activating the ZE motor velocity fuzzy set 100% with 
probability .2, and certainly not the same as Prob{u = 0} = .2. Instead a deterministic 
20% of ZE should result and, according to the additive combination formula (17), should 
be added to the final output fuzzy set. 

The correlation-minimum inference procedure activates the angular velocity fuzzy set 
ZE to degree .2 by taking the pairwise minimum of .2 and the ZE fuzzy set rn^ E : 

m\n(m ZE (15), tti ze [ — 10)) A m^^v) = .2 A tti ze {v) 

for all velocity values v. The correlation-product inference procedure would simply multiply 
the zero angular velocity fuzzy set by .2 : .2 m v ZE (v) for all v. 

The data similarly activate the FAM rule ( PS,ZE ; NS) depicted in Figure 17.4. The 
angle datum 15 is a small but positive angle value to degree .8. The angular velocity datum 
-10 is a zero angular velocity value to degree .5. So the output motor velocity fuzzy set of 
small but negative motor velocity values is scaled by .5, the lesser of the two antecedent 
fit values: 
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min(mp S (15), mf^( — 10)) A m v NS (v) — .5 A m v NS (v) 

for all velocity values v. So the data activate the FAM rule ( PS , ZE ; NS) to^reater degree 
than the steady-state FAM rule (ZE, ZE; ZE) since in this example an angle value of 15 
degrees is more a small but positive angle value than a zero angle value. 

The data similarly activate the other 13 FAM rules. We combine the resulting minimum- 
scaled consequent fuzzy sets according to (17) by summing pointwise. We can then com- 
pute the fuzzy centroid with equation (19), with perhaps integrals replacing the discrete 
sums, to determine the specific output motor velocity v. In Chapter 19 we show that, for 
symmetric fuzzy sets of quantization, the centroid can always be computed exactly with 
simple discrete sums even if the fuzzy sets are continuous. In many realtime applications 
we must repeat this entire FAM inference procedure hundreds, perhaps thousands, of times 
per second. This requires fuzzy VLSI or optical processors. 

Figure 17.4 illustrates this equal-weight additive combination procedure for just the 
FAM rules (ZE, ZE; ZE) and (PS, ZE; NS). The fuzzy-centroidal motor velocity value 
in this case is -3. 
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FAM Rule ( P5. NS ; NS ) 



^ 

♦ 

Fuzzy Centroid: | u ■ -3 | 

FIGURE 17.4 FAM correlation-minimum inference procedure. The FAM 
system consists of the two two-antecedent FAM rules ( PS,ZE ; NS) and 
( ZE , ZE\ ZE). The input angle datum is 15, and is more a small but pos- 
itive angle value than a zero angle value. The input angular velocity datum 
is -10, and is only a zero angular velocity value to degree .5. Antecedent fit 
values are combined with minimum since the antecedent terms are combined 
conjunctively with AND. The combined fit value then scales the consequent 
fuzzy set with pairwise minimum. The minimum-scaled output fuzzy sets are 
added pointwise. The fuzzy centroid of this output waveform is computed and 
yields the system output velocity value -3. 



Multi-Antecedent FAM Rules: Decompositional Inference 

The BIOFAM inference procedure treats antecedent fuzzy sets as if they were propo- 
sitions with fuzzy truth values. This is because fuzzy logic corresponds to 1-dimensional 


36 


! l in 'l 




fuzzy set theory and because we use binary or exact inputs. We now formally develop the 
connection between BIO FA Ms and the FAM theory presented earlier. 

Consider the compound FAM rule “IF X is A AND Y is B , THEN C is Z,” 
or (A, B\ C ) for short. Let the universes of discourse X , Y, and Z have dimensions n, p , 
and q: X = {x u ...,x n },Y = {y u . . - ,y„}, and Z = {z u .... z q }. We can directly 

extend this framework to multiple antecedent and consequent terms. 

In our notation X , Y, and Z are both universes of discourse and fuzzy variables. The 
fuzzy variable X can assume the fuzzy set values Ai,A 2 ,..., and similarly for the fuzzy 
variables Y and Z. When controlling an inverted pendulum, the identification “X is A” 
might represent the natural-language description “The pendulum angle is positive but 
small.” 

What is the matrix representation of the FAM rule (A, B ; C)? The question is nontriv- 
ial since A, B, and C are fuzzy subsets of different universes of discourse, points in different 
unit cubes. Their dimensions and interpretations differ. Mamdani [1977] and others have 
suggested representing such rules as fuzzy multidimensional relations or arrays. Then the 
FAM rule (A, B] C ) would be a fuzzy subset of the product space X x Y x Z. This rep- 
resentation is not used in practice since only exact inputs are presented to FAM systems 
and the BIOFAM procedure applies. If we presented the system with a genuine fuzzy set 
input, we would no doubt preprocess the fuzzy set with a centroidal or maximum-fit- value 
technique so we could still apply the BIOFAM inference procedure. 

M^e present an alternative representation that decomposes, then recomposes, the FAM 
rule (A, B] C ) in accord with the FAM inference procedure. This representation allows 
neural networks to adaptively estimate, store, and modify the decomposed FAM rules. The 
representation requires far less storage than the multidimensional-array representation. 

Let the fuzzy Hebb matrices Mac and Mbc store the simple FAM associations (A, C ) 
and (i?, C ): 


Mac = A T o C , 

(20) 

M bc = B T o C . 

(21) 
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The fuzzy Hebb matrices Mac and Mbc split the compound FAM rule (A, B\ C). We can 
construct the splitting matrices with correlation-product encoding. 

Let I'x = (0 ... 0 1 0 ... 0) be an n-dimensional bit vector with tth element 1 and all 
other elements 0. I'x is the tth row of the n-by-n identity matrix. Similarly, Iy and I % are 
the respective jth and fcth rows of the p-by-p and < 7 *by -<7 identity matrices. The bit vector 
I l x represents the occurrence of the exact input X{. 

We will call the proposed FAM representation scheme FAM decompositional infer- 
ence, in the spirit of the max-min compositional inference scheme discussed above. FAM 
decompositional inference decomposes the compound FAM rule (A, B\ C ) into the com- 
ponent rules (A, C) and (B,C). The simpler component rules are processed in parallel. 
New fuzzy set inputs A! and B‘ pass through the FAM matrices Mac and Mbc ■ Max-min 
composition then gives the recalled fuzzy sets C A ‘ and Cb 1 - 


Ca> 

= A' o Mac > 

(22) 

Cb< 

= B' o Mbc 

(23) 


The trick is to recompose the fuzzy sets C# and Cb< with intersection or union according 
as the antecedent terms “X is A” and U Y is B ” are combined with AND or OR. The negated 
antecedent term U X is NOT A” requires forming the set complement C C A , for input fuzzy 
set A'. 

Suppose we present the new inputs A' and B' to the single-FAM-rule system F that 
stores the FAM rule (A, B; C). Then the recalled output fuzzy set C' equals the intersec- 
tion of Ca> and Cb>' 


F(A\ B') = [A' o M A c ] O [B' o M BC ] 

= Ca> n c B > 

= C' . 


We can then defuzzify C', if we wish, to yield the exact output /|. 


(24) 
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The logical connectives apply to the antecedent terms of different dimension and mean- 
ing. Decompositional inference applies the set-theoretic analogues of the logical connectives 
to subsets of Z. Of course all subsets C' of Z have the same dimension and meaning. 

We now prove that decompositional inference generalizes BIOFAM inference. This gen- 
eralization is not simply formal. It opens an immediate path to adaptation with arbitrary 
neural network techniques. 

Suppose we present the exact inputs and yj to the single-FAM-rule system F that 
stores (A, B\ C). So we present the unit bit vectors I' x and Iy to F as nonfuzzy set inputs. 
Then 


F(x„y,) = F(l‘ x , l’y) = [F x o M xc ] n [/’ o Mbc] 


a, A C O bj A C 

(25) 

min(ai, bj) A C . 

(26) 


(25) follows from (8). Representing C with its membership function me, (26) is equivalent 
to the BIOFAM prescription 

min(a t , bj ) A mc{z) (27) 

for all z in Z. 

If we encode the simple FAM rules (A, C ) and ( B , C) with correlation-product encoding, 
decompositional inference gives the BIOFAM version of correlation-product inference: 

F(l‘ x ,Vy) = Vx ° A T C) n [Vy o B t C] 


= a, C n bj C 

(28) 

= min(a,-, bj) C 

(29) 

= min(af, bj) mc{z) 

(30) 


for all z in Z. (13) implies (28). min(ai c*, bj c* ) = min(a,-, bj) c* implies (29). 
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Decompositional inference allows arbitrary fuzzy sets, waveforms, or distributions A’ 
and B’ to be applied to a FAM system. The FAM system can house an arbitrary FAM 
bank of compound FAM rules. If we use the FAM system to control a process, the input 
fuzzy sets A' and B' can be the output of an independent state- estimation system, such 
as a Kalman filter. A! and B' might then represent probability distributions on the exact 
input spaces X and Y. The filter-controller cascade is a common engineering architecture. 

We can split compound consequents as desired. We can split the compound FAM rule 
“IF X is A AND Y is B , THEN Z is C OR W is D,”or(A,£; C,D ), 
into the FAM rules (A, B\ C ) and (A, B ; D). We can use the same split if the consequent 
logical connective is AND. 

We can give a propositional-calculus justification for the decompositional inference 
technique. Let A, B , and C be bivalent propositions with truth values f(A), t(B), and 
t(C) in { 0 , 1 }. Then we can construct truth tables to prove the two consequent-splitting 
tautologies that we use in decompositional inference: 


[A — ► 

(B OR C)) - 

- U 

—4 B) OR 

(A — . C)] , 

(31) 

[A — 

(B AND C )] - 

- [{A 

— ► B ) AND 

{A — C)] , 

(32) 


where the arrow represents logical implication. 

In bivalent logic, the implication A — ► B is false iff the antecedent A is true and the 
consequent B is false. Equivalently, t(A — * B) = 1 iff t(A) = 1 and t(B) = 0. 

This allows a “brief” truth table to be constructed to check for validity. We chose truth 
values for the terms in the consequent of the overall implication (31) or (32) to make 
the consequent false. Given those restrictions, if we cannot find truth values to make the 
antecedent true, the statement is a tautology. In (31), if f((A — > B ) OR {A — + C)) = 0, 
then t(A) = 1 and t{B) = t(C) = 0, since a disjunction is false iff both disjuncts are 
false. This forces the antecedent A — > (B OR C) to be false. So (31) is a tautology: It 
is true in all cases. 

We can also justify splitting the compound FAM rule “IF X is A OR Y is B , 
THEN Z is C ” into the disjunction (union) of the two simple FAM rules “IF X is A , 
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THEN Z is C ” and “IF Y is B , THEN Z is C ” with a propositional tautology: 


[(A OR B) — C] — ^ [(/l — ► C) OR ( B — + C)\ . (33) 

Now consider splitting the original compound FAM rule “IF X is A AND Y is B , 
THEN Z is C ” into the conjunction (intersection) of the two simple FAM rules “IF X 
is A , THEN Z is C and “IF Y is B , THEN Z is C A problem arises when 
we examine the truth table of the corresponding proposition 

[(A AND B) — ► C) — ♦ [(A — y C ) AND ( B — * C)] . (34) 

The problem is that (34) is not always true, and hence not a tautology. The implication 
is false if A is true and B and C are false, or if A and C are false and B is true. But the 
implication (34) is valid if both antecedent terms A and B are true. So if t(A) = t(B) = 1, 
the compound conditional {A AND B) — ► C implies both A — » C and B —> C. 

The simultaneous occurrence of the data values x, and y } satisfies this condition. Recall 
that logic is 1-dimensional set theory. The condition t(A) = t(B) = 1 is given by the 1 in 
I l x and the 1 in I 3 X . We can interpret the unit bit vectors I' x and as the (true) bivalent 
propositions “'X is x,” and “Y is y 3 . ” Propositional logic applies coordinate-wise. A 
similar argument holds for the converse of (33). 

For general fuzzy set inputs A! and B ’ the argument still holds in the sense of continuous- 
valued logic. But the truth values of the logical implications may be less than unity while 
greater than zero. If A! is a null vector and B' is not, or vice versa, the implication (34) 
is false coordinate-wise, at least if one coordinate of the non-null vector is unity. But in 
this case the decompositional inference scheme yields an output null vector C'. In effect 
the FAM system indicates the propositional falsehood. 

Adaptive Decompositional Inference 


The decompositional inference scheme allows the splitting matrices M AC and Mbc to 
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be arbitrary. Indeed it allows them to be eliminated altogether. 

Let Nx : I n —* I q be an arbitrary neural network system that maps fuzzy subsets A! 

of X to fuzzy subsets C 1 of Z. Ny ■ — * J q can be a different neural netwoik. In general 

Nx and Ny are time-varying. 

The adaptive decompositional inference (ADI) scheme allows compound FAM rules to 
be adaptively split, stored, and modified by arbitrary neural networks. The compound 
FAM rule “IF X is A AND Y is B, THEN Z is C” or (A,B; C), can be split 
by Nx and Ny. Nx can house the simple FAM association (A, C). Ny can house (5,C). 
Then for arbitrary fuzzy set inputs A' and B\ ADI proceeds as before for an adaptive 
FAM system F : I n x I p -+ I q that houses the FAM rule ( A,B ; C ) or a bank of such 

FAM rules: 


F(A\B') = Nx(A') n Ny{B') (35) 

= Ca 1 n Cb 1 
= C' . 

Any neural network technique can be used. A reasonable candidate for many un- 
structured problems is the backpropagation algorithm applied to several small feedforward 
multilayer networks. The primary concerns are space and training time. Several small 
neural networks can often be trained in parallel faster, and more accurately, than a single 
large neural network. 

The ADI approach illustrates one way neural algorithms can be embedded in a FAM 
architecture. Below we discuss another way that uses unsupervised clustering algorithms. 


ADAPTIVE FAMs: PRODUCT-SPACE CLUSTERING 

IN FAM CELLS 

An adaptive FAM (AFAM) is a time-varying mapping between fuzzy cubes. In 
principle the adaptive decompositional inference technique generates AFAMs. But we 
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shall reserve the label AFAM for systems that generate FAM rules from training data but 
that do not require splitting and recombining FAM data. 

We propose a geometric AFAM procedure. The procedure adaptively clusters training 
samples in the FAM system input-output product space. FAM mappings are balls or clusters 
in the input-output product space. These clusters are simply the fuzzy Hebb matrices 
discussed above. The procedure “blindly” generates weighted FAM rules from training 
data. Further training modifies the weighted set of FAM rules. We call this unsupervised 
procedure product-space clustering. 

Consider first a discrete 1-dimensional FAM system 5 : /” — ► I p . Then a FAM rule 

has the form “IF X is A* , THEN Y is B { ” or (A„ Bi). The input-output product 
space is I n x I p . 

What does the FAM rule (A^, i B,) look like in the product space /” x 7 P ? It looks like a 
cluster of points centered at the numerical point (A,-, Bi). The FAM system maps points 
A near A,- to points B near B{. The closer A is to A,, the closer the point (A, B) is to the 
point (A,-, Bi ) in the product space 7" x I p . In this sense FAMs map balls in 7 n to balls 
in I p . The notation is ambiguous since (A,, Bi) stands for both the FAM rule mapping, 
or fuzzy subset of 7 n X 7 P , and the numerical fit-vector point in 7" x I p . 

Adaptive clustering algorithms can estimate the unknown FAM rule (A;, Bi) from train- 
ing samples of the form (A, B). In general there are m unknown FAM rules (A x , J B x ), . . . , 
(A m , B m ). The number m of FAM rules is also unknown. The user may select m arbitrarily 
in many applications. 

Competitive adaptive vector quantization (AVQ) algorithms can adaptively estimate 
both the unknown FAM rules (A,, B,) and the unknown number m of FAM rules from 
FAM system input-output data. The AVQ algorithms do not require fuzzy-set data. Scalar 
BIOFAM data suffices, as we illustrate below for adaptive estimation of inverted-pendulum 
control FAM rules. 

Suppose the r fuzzy sets A x , . . . , A r quantize the input universe of discourse X. The 
s fuzzy sets 7? x , . . . , B s quantize the output universe of discourse V. In general r and s 
are unrelated to each other and to the number m of FAM rules (A^, Bi). The user must 
specify r and s and the shape of the fuzzy sets A, and £?,. In practice this is not difficult. 
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Quantizing fuzzy sets are usually trapezoidal, and r and s are less than 10. 

The quantizing collections {A,} and {Bj} define rs FAM cells Fij in the input-output 
product space 7 n x I p . The FAM cells Fij overlap since contiguous quantizing fuzzy sets A, 
and A i+1 , and Bj and B j+ u overlap. So the FAM cell collection {Fij} does not partition 
the product space 7" x I p . The union of all FAM cells also does not equal 7” x I p since 
the patches F.j are fuzzy subsets of 7" x I p . The union provides only a fuzzy “cover” for 
7 n x I p . 

The fuzzy Cartesian product Ai x Bi defines the FAM cell F^. A, x Bi is just the 
fuzzy outer product Aj o Bi in (6) or the correlation product Af Bi in (12). So a FAM cell 
Fij is simply the fuzzy correlation-minimum or correlation-product matrix Mij : Fi 3 = Mij. 

Adaptive FAM Rule Generation 

Let be k quantization vectors in the input-output product space I n x I p 

or, equivalently, in 7” +p . nv, is the jrth column of the synaptic connection matrix M. M 
has n + p rows and k columns. 

Suppose, for instance, nij changes in time according to the differential competitive 
learning (DCL) AVQ algorithm discussed in Chapters 6 and 9. The competitive system 
samples concatenated fuzzy set samples of the form [A|7?]. The augmented fuzzy set [A|7?] 
is a point in the unit hypercube I n+P . 

The synaptic vectors nr, converge to FAM matrix centroids in 7 n x I p . More generally 
they estimate the density or distribution of the FAM rules in 7 n x I p . The quantizing 
synaptic vectors naturally weight the estimated FAM rule. The more synaptic vectors 
clustered about a centroidal FAM rule, the greater its weight Wi in (17). 

Suppose there are 15 FAM-rule centroids in 7 n x I p and k > 15. Suppose k t synaptic 
vectors mj cluster around the ith centroid. So k\ + ... + &is — k. Suppose the cluster 
counts ki are ordered as 

ki > k2 > ■ ■ ■ ki 5 . (36) 
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The first centroidal FAM rule is as at least as frequent as the second centroidal FAM 
rule, and so on. This gives the adaptive FAM-rule weighting scheme 

w, = | . (37) 

The FAM rule weights w, evolve in time as new augmented fuzzy sets [A|B] are sampled. 
In practice we may want only the 15 most-frequent FAM rules or only the FAM rules with 
at least some minimum frequency uw. Then (37) provides a quantitative solution. 

Geometrically we count the number % of quantizing vectors in each FAM cell F l} . We 
can define FAM-cell boundaries in advance. High-count FAM cells outrank low-count FAM 
cells. Most FAM cells contain zero or few synaptic vectors. 

Product-space clustering extends to compound FAM rules and product spaces. The 
FAM rule “IF X is A AND Y is B , THEN Z is C'\ or (A, B\ C ), is a point in 
/" x I p x I q . The t fuzzy sets quantize the new output space Z. There are 

rst FAM cells F ijk ■ (36) and (37) extend similarly. X, F, and Z can be continuous. The 
adaptive clustering procedure extends to any number of FAM-rule antecedent terms. 

Adaptive BIOFAM Clustering 

BIOFAM data clusters more efficiently than fuzzy-set FAM data. Paired numbers are 
easier to process and obtain than paired fit vectors. This allows system input-output data 
to directly generate FAM systems. 

In control applications, human or automatic controllers generate streams of “well- 
controlled” system input-output data. Adaptive BIOFAM clustering converts this data 
to weighted FAM rules. The adaptive system transduces behavioral data to behavioral 
rules. The fuzzy system learns causal patterns. It learns which control inputs cause which 
control outputs. The system approximates these causal patterns when it acts as the con- 

troller. 

Adaptive BIOFAMs cluster in the input-output product space X x Y - The product 
space X x Y is vastly smaller than the power-set product space /” x I p used above. The 
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adaptive synaptic vectors rrij are now 2 -dimensional instead of n + p-dimensional. On 
the other hand, competitive BIOFAM clustering requires many more input-output data 
pairs (x{,j/i) c P? than augmented fuzzy-set samples [A|B] t I 

Again our notation is ambiguous. We now use Xi as the numerical sample from A 
at sample time i. Earlier x, denoted the zth ordered element in the finite nonfuzzy set 
X = {xi,...,x„}. One advantage is X can be continuous, say R n . 

BIOFAM clustering counts synaptic quantization vectors in FAM cells. The system 
samples the nonfuzzy input-output stream (xi, yi), (*2, Vt), ■ Unsupervised competitive 
learning distributes the k synaptic quantization vectors mi, . . . ,m fc in X x Y. Learning 
distributes them to different FAM cells F tj . The FAM cells F 0 overlap but are nonfuzzy 
subcubes of X x Y . The BIOFAM FAM cells t\j cover X x Y. 

Fij contains k tJ quantization vectors at each sample time. The cell counts k t} define a 
frequency histogram since all kij sum to k. So w t] = j*- weights the FAM rule IF X is 
A{, THEN Y is Bj.” 

Suppose the pairwise-overlapping fuzzy sets NL, NM, NS, ZE,P S, PM, PL quan- 
tize the input space X. Suppose seven similar fuzzy sets quantize the output space Y. We 
can define the fuzzy sets arbitrarily. In practice they are normal and trapezoidal. (The 
boundary fuzzy sets NL and PL are ramp functions.) X and Y may each be the real line. 
A typical FAM rule is “IF X is NL, THEN Y is PS.” 

Input datum x; is nonfuzzy. When X = x, holds, the relations X = NL, . . . ,X — PL 
hold to different degrees. Most hold to degree zero. X — NM holds to degree itinm(xi)- 
Input datum x; partially activates the FAM rule “IF X is NM, THEN Y is ZE or, 
equivalently, (NM; ZE). Since the FAM rules have single antecedents, x, activates the 
consequent fuzzy set ZE to degree jti/va^x,) as well. Multi-antecedent FAM rules activate 
output consequent sets according to a logic-based function of antecedent term membership 
values, as discussed above on BIOFAM inference. 

Suppose Figure 17.5 represents the input-output data stream (xj, y i), (x2, 3/2)1 • • • in the 
planar product space X x Y: 
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NL NM NS ZE PS PM PL 



FIGURE 17.5 Distribution of input-output data (a^, y,) in the input-output 
product space X xY . Data clusters reflect FAM rules, such as the steady-state 
FAM rule “IF X is ZE, THEN Y is ZE". 

Suppose the sample data in Figure 17.5 trains a DCL system. Suppose such competi- 
tive learning distributes ten 2-dimensional synaptic vectors mi, . . . , mio as in Figure 17.6: 
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FIGURE 17.6 Distribution of ten 2-dimensional synaptic quantization vec- 
tors mi, . . . , mio in the input-output product space XxY. As the FAM system 
samples nonfuzzy data competitive learning distributes the synaptic 

vectors in XxY. The synaptic vectors estimate the frequency distribution of 
the sampled input-output data, and thus estimate FAM rules. 


FAM cells do not overlap in Figures 17.5 and 17.6 for convenience’s sake. The corre- 
sponding quantizing fuzzy sets touch but do not overlap. 

Figure 17.5 reveals six sample-data clusters. The six quantization- vector clusters in 
Figure 17.6 estimate the six sample-data clusters. The single synaptic vector in FAM cell 
( PM ; NS) indicates a smaller cluster. Since k = 10, the number of quantization vectors 
in each FAM cell measures the percentage or frequency weight Wij of each possible FAM 
rule. 
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In general the additive combination rule (17) does not require normalizing the quantization- 
vector count kij. Wij = kij is acceptable. This holds for both maximum-membership de- 
fuzzification (18) and fuzzy centroid defuzzification (19). These defuzzification schemes 
prohibit only negative weight values. 

The ten quantization vectors in Figure 17.6 estimate at most six FAM rules. From most 
to least frequent or “important”, the FAM rules are ( ZE ; ZE ), ( PS ; NS ), ( NS ; PS ), 
(PM; NS), (PL; NL), and ( NL ; PL). These FAM rules suggest that fuzzy variable X is 
an error variable or an error velocity variable since the steady-state FAM rule ( ZE ; ZE) is 
most important. If we sample a system only in steady-state equilibrium, we will estimate 
only the steady-state FAM rule. We can accurately estimate the FAM system’s global 
behavior only if we representatively sample the system s input-output behavior. 

The “corner” FAM rules (PL; N L) and (NL] PL) may be more important than their 
frequencies suggest. The boundary sets Negative Large (NL) and Positive Large (PL) 
are usually defined as ramp functions, as negatively and positively sloped lines. NL and 
PL alone cover the important end-point regions of the universe of discourse X. They give 
mpfL,(x) =* mpi(x) = 1 only if x is at or near the end-point of jA, since A7L and PL are 
ramp functions not trapezoids. NL and PL cover these end-point regions “briefly”. Their 
corresponding FAM cells tend to be smaller than the other FAM cells. The end-point 
regions must be covered in most control problems, especially error nulling problems like 
stabilizing an inverted pendulum. The user can weight these FAM-cell counts more highly, 
for instance - c k i} for scaling constant c > 0. Or the user can simply include these 
end-point FAM rules in every operative FAM bank. 

Most FAM cells do not generate FAM rules. More accurately, we estimate every possible 
FAM rule but usually with zero or near-zero frequency weight w i} . For large numbers of 
multiple FAM-rule antecedents, system input-output data streams through comparatively 
few FAM cells. Structured trajectories in X x Y are few. 

A FAM-rule’s mapping structure also limits the number of estimated FAM rules. A 
FAM rule maps fuzzy sets in /” or F( 2^ ) to fuzzy sets in I p or F(2 Y ). A fuzzy associative 
memory maps every domain fuzzy set A to a unique range fuzzy set B. Fuzzy set A cannot 
map to multiple fuzzy sets B , B', B", and so on. We write the FAM rule as (A; B) not 


49 



(A; B or B' or B" or .,..). So we estimate at most one rule per FAM-cell row in Figure 
17.6. 

If two FAM cells in a row are equally and highly frequent, we can pick arbitrarily either 
FAM rule to include in the FAM bank. This occurs infrequently but can occur. In principle 
we could estimate the FAM rule as a compound FAM rule with a disjunctive consequent. 
The simplest strategy picks only the highest frequency FAM cell per row. 

The user can estimate FAM rules without counting the quantization vectors in each 
FAM cell. There may be too many FAM cells to search at each estimation iteration. 
The user never need examine FAM cells. Instead the user checks the synaptic vector 
components m tJ . The user defines in advance fuzzy-set intervals, such as [Inl, «/V£,] for 
NL. If Inl < m ij < V-NL, then the FAM-antecedent reads “IF X is NL." 

Suppose the input and output spaces X and Y are the same, the real interval [—35, 35]. 
Suppose we partition X and Y into the same seven disjoint fuzzy sets: 

NL = [-35, -25] 

NM = [-25, -15] 

NS = [-15, -5] 

ZE = [-5, 5] 

PS = [5, 15] 

PM = [15, 25] 

PL = [25, 35] . 

Then the observed synaptic vector m_, = [9, —10] increases the count of FAM cell 

PS x NS and increases the weight of FAM rule ”IF X is PS, THEN Y is NS .” 

This amounts to nearest-neighbor classification of synaptic quantization vectors. We 
assign quantization vector m* to FAM cell F XJ iff mjt is closer to the centroid of F X] than 
to all other FAM-cell centroids. We break ties arbitrarily. Centroid classification allows 
the FAM cells to overlap. 
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Adaptive BIOFAM Example: Inverted Pendulum 


We used DCL to train an AFAM to control the inverted pendulum discussed above. 
We used the accompanying C-software to generate 1,000 pendulum trajectory data. These 
product-space training vectors ( 8 , A8, t>) were points in R 3 . Pendulum angle 8 data 
ranged between —90 and 90. Pendulum angular veclocity A8 data ranged from —150 to 
150. 

We defined FAM cells by uniformly partitioning the effective product space. Fuzzy 
variables could assume only the five fuzzy set values N Af, NS, ZE, PS, and PM. So 
there were 125 possible FAM rules. For instance, the steady-state FAM rule took the form 
{ZE, ZE ; ZE) or, more completely, “IF 8 — ZE AND A6 = ZE, THEN v = ZE .” 
A BIOFAM controlled the inverted pendulum. The BIOFAM restored the pendulum 
to equilibrium as we knocked it over to the right and to the left. (Function keys F9 and 
F10 knock the pendulum over to the left and to the right. Input-output sample data 
reads automatically to a training data file.) Eleven FAM rules described the BIOFAM 
controller. Figure 17.1 displays this FAM bank. Observe that the zero {ZE) row and 
column are ordinal inverses of the respective row and column indices. 
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BIOFAM generated 1,000 sample vectors of the form ( 0 , A0, v). 


We trained 125 3-dimensional synaptic quantization vectors with differential compet- 
itive learning, as discussed in Chapters 4,6, and 9. In principle the 125 synaptic vectors 
could describe a uniform distribution of product-space trajectory data. Then the 125 
FAM cells would each contain one synaptic vector. Alternatively, if we used a vertically 
stabilized pendulum to generate the 1,000 training vectors, all 125 synaptic vectors would 
concentrate in the ( ZE , ZE] ZE) FAM cell. This would still be true if we only mildly 
perturbed the pendulum from vertical equilibrium. 

DCL distributed the 125 synaptic vectors to 13 FAM cells. So we estimated 13 FAM 
rules. Some FAM cells contained more synaptic vectors than others. Figure 17.8 displays 
the synaptic- vector histogram after the DCL samples the 1,000 samples. Actually Figure 
17.8 displays a truncated histogram. The horizontal axis should list all 125 FAM cells, 
all 125 FAM- rule weights iv k in (17). The missing 112 entries have zero synaptic- vector 
frequency. 

Figure 17.8 gives a snapshot of the adaptive process. In practice, and in principle, 
successive data gradually modify the histogram. “Good” training samples should include 
a significant number of equilibrium samples. In Figure 17.8 the steady-state FAM cell 
(ZE, ZE] ZE) is clearly the most frequent. 
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FIGURE 17.8 Synaptic- vector histogram. Differential competitive learn- 
ing allocated 125 3-dimensional synaptic vectors to the 125 FAM cells. Here 
the adaptive system has sampled 1,000 representative pendulum- control data, 
DCL allocates the synaptic vectors to only 13 FAM cells. The steady-state 
FAM cell ( ZE , ZE\ ZE ) is most frequent. 


Figure 17.9 displays the DCL-estimated FAM bank. The product-space clustering 
method rapidly recovered the 1 1 original FAM rules. It also estimated the two additional 
FAM rules (PS, NM\ ZE) and (NS, PM ; ZE), which did not affect the BIOFAM 
system’s performance. The estimated FAM bank defined a BIOFAM, with all 13 FAM- 
rule weights set u>k equal to unity, that controlled the pendulum as well as the original 
BIOFAM did. 
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FIGURE 17.9 DCL-estimated FAM bank. Product-space clustering re- 
covered the original 11 FAM rules and estimated two new FAM rules. The new 
and original BIOFAM systems controlled the inverted pendulum equally well. 

In nonrealtime applications we can in principle omit the adaptive step altogether. We 
can directly compute the FAM-cell histogram if we exhaustively count all sampled data. 
Then the (growing) number of synaptic vectors equals the number of training samples. This 
procedure equally weights all samples, and so tends not to “track” an evolving process. 
Competitive learning weights more recent samples more heavily. Competitive learning’s 
metrical-classification step also helps filter noise from the stream of sample data. 
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PROBLEMS 


1. Use correlation-minimum encoding to construct the FAM matrix M from the fit 
vector pair (A, B) if A = (.6 1 .2 .9) and B = (.8 .3 1). Is (A,H) a bidirectional 
fixed point? Pass A' = (.2 .9 .3 .2) through M and B' = (.9 .5 1) through M T . 
Do the recalled fuzzy sets differ from B and A? 
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2. Repeat Problem 1 using correlation-product encoding. 


3. Compute the fuzzy entropy E(M) of M in Problems 1 and 2. 

4. If M = A t o B in Problem 1, find a different FAM matrix M' with greater fuzzy 

entropy, E(M') > E(M), but that still gives perfect recall: A o M' = B. 

Find the maximum entropy fuzzy associative memory (MEFAM) matrix M* such 
that A o M* = B. 


5. Prove: If M = A T o B or M = A r B,A o M = B, and A <r A', then 

A' o M = B. 


6. Prove: max min(a*, b k ) < min( max a k , max b k ). 

1 <*< m V l<Jt<rri *’ l<k<m k) 

7. Use truth tables to prove the two-valued propositional tautologies: 

(a) [ A y (B OR C)] — * [(A — ► B) OR (A — > C)] 

(b) [A — > (B AND C)] — i f(A — > B) AND (A — ► C)] 

(c) [ (A OR B) — ► C] — > [(A — h C) OR (B — ► C)] 

(d) [(A — > C) AND (B — ► C)] — h [ (A AND B) — tC] . 

Is the converse of (c) a tautology? Explain whether this affects BIOFAM inference. 

8. BIOFAM inference. Suppose the input spaces X and Y are both [-10, 10], and the 

output space Z is [-100, 100]. Define five trapezoidal fuzzy sets- NL, NS , ZE, PS , PL 

on X, F, and Z. Suppose the underlying (unknown) system transfer function is 
z = x 2 - y 2 . State at least five FAM rules that accurately describe the system’s 
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behavior. Use z — x 2 — y 2 to generate streams of sample data. Use BIOFAM in- 
ference and fuzzy-centroid defuzzification to map input pairs (x, y) to output data z. 
Plot the BIOFAM outputs and the desired outputs z. What is the arithmetic average 
of the squared errors (F(x,y) — x 2 + y 2 ) 2 ? Divide the product space X x Y x Z 
into 125 overlapping FAM cells. Estimate FAM rules from clustered system data 
(x,y,z). Use these FAM rules to control the system. Evaluate the performance. 


Software Problems 

The following problems use the accompanying FAM software for controlling an inverted 
pendulum. 

1. Explain why the pendulum stabilizes in the diagonal position if the pendulum bob 
mass increases to maximum and the motor current decreases slightly. The pendulum 
stabilizes in the vertical position if you remove which FAM rules? 

2. Oscillation results if you remove which FAM rules? The pendulum sticks in a hori- 
zontal equilibrium if you remove which FAM rules? 
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ABSTRACT 


We discuss recent theorems proving that artificial neural networks are 
capable of approximating an arbitrary mapping and its derivatives as 
accurately as desired. This fact forms the basis for further results 
establishing the leamability of the desired approximations, using results 
from non-parametric statistics. These results have potential applications in 
robotics, chaotic dynamics, control, and sensitivity analysis (physics, 
chemistry, and engineering). We discuss an example involving learning the 
transfer function and its derivatives for a chaotic map. 


60 


Jordan (1989), "Generic Constraints on Underspecified Target Trajectories," 


Proceedings IJCNN , Washington D.C.: 


The Jacobian matrix dz/dx ... is the matrix that relates small changes in the 
controller output to small changes in the task space results and cannot be 
assumed to be available a priori, or provided by the environment. However, 
all of the derivatives in the matrix are forward derivates. They are easily 
obtained by differentiation if a forward model is available. The forward 
model itself must be learned, but this can be achieved directly by system 
identification. Once the model is accurate over a particular domain, its 
derivatives provide a learning operator that allows the system to convert 
errors in task space into errors in articulartory space and thereby change the 
controller. 
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ABSTRACT 


We give conditions ensuring that multilayer feedforward networks with as few as a 
single hidden layer and an appropriately smooth hidden layer activation function are 
capable of arbitrarily accurate approximation to an arbitrary function and its derivatives. 
In fact, these networks can approximate functions that are not differentiable in the 
classical sense, but possess only a generalized derivative, as is the case for certain 
piecewise differentiable functions. The conditions imposed on the hidden layer 
activation function are relatively mild; the conditions imposed on the domain of the 
function to be approximated have practical implications. Our approximation results 
provide a previously missing theoretical justification for the use of multilayer 
feedforward networks in applications requiring simultaneous approximation of a function 

and its derivatives. 
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Relevant Application Areas: 


1. Robotics 


2. Chaotic Dynamics 


3. Control 


4. Sensitivity Analysis (Physics, Chemistry, Engineering) 
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Intuition suggests that networks having smooth hidden layer activation functions 
ought to have output function derivatives that will approximate the derivatives of an 
unknown mapping. However, the justification for this intuition is not obvious. Consider 
the class of single hidden layer feedforward networks having network output functions 
belonging to the set 

2(G) = [g : IR r -» IR I g(x) = £ PjG(x T Yj ); 

7=1 

xe IR r ,Pj(= IR, Yj G IR r+l J = l,...,q,qe IN}, 

where x represents an r vector of network inputs (r e IN= {1, 2 , x = (l,x ) 
(the superscript T denotes transposition), Pj represents hidden to output layer weights 
and Yj represents input to hidden layer weights, j = 1,..., q, where q is the number of 
hidden units, and G is a given hidden unit activation function. The first partial 
derivatives of the network output function are given by 

3g(jt) / dxi = £ PjYji DG(x t Yj), i = 1, .... r, 

7=1 

where Xi is the ith component of x, Yji is the ith component of Yj, i = l,...,r (YjO is the 
input layer bias to hidden unit j ), and DG denotes the first derivative of G. 
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Single Hidden Layer Feedforward Network 
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1. Mathematical Background 

2. Approximation Results 
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4. Example: Learning Chaotic Map 
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1. MATHEMATICAL BACKGROUND 


Let U be an open subset of lR r , and let C ( U ) be the set of all functions continuous 
on U. Let a bean r-tuplea = (gX \ % . . . ,CC r ) T of non-negative integers (a "multi-index"). 
Jfx belongs to K r , let x a - 1 ■ ... ■ x“'. Denote by D a the partial derivative 

9 1“ I ldx a = d 1“ I I0x i ' dx% 2 ...dx ? r ) 

of order Ja | =a\ +Ct 2 +...+ a r . For non-negative integers m, we define 
C m (£/)= {/e C{U): D a fe C(U) for alia, |a| <>m} and C°°(U) = 

We let D° be the identity, so that C°(U) = C(t/). Thus, the functions in C m (U) have 
continuous derivatives up to order m on U, while the functions in C (t/) have 
continuous derivatives on U of every order. We shall be interested in approximating 
elements of C m {U) using feedforward networks. When U * lR r , the fact that network 
output functions (elements of L(G)) will belong to C m ( IR TS ) necessitates considering 
their restriction to U , written g | u for g in E(G). Recall that g | u(x) — g (x) for x in U 
and is not defined for x not in U, thus g 1 1/ e C m (U ), as desired.) 
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DEFINITION 2.1: Let U be a subset of lR r , let S be a collection of functions /: 
U — » IR and let p be a metric on S. For any g in 2(G) (recall g : JR T — » IR) define 
the restriction of g to U, g \ u as g \ u(x) = g (x) for x in U, g \ u (x) unspecified for x 
not in U. 

Suppose that for any / in S and e > 0 there exists g in 2(G) such that 
P(/» 8iu) <e - Then we say that 2(G) contains a subset p -dense in 5. If in addition 
g | u belongs to S for every g in 2(G), we say that 2(G) isp -dense in S. □ 

DEFINITION 2.2: Let m, l e {0} u 22V, 0 < m ^ /, and G c R T be given, and let 
S c C l (U). Suppose that for any / in S, compact K c U and e > 0 there exists g in 
2(G) such that max| a | sup xeK | D a fix) - D a g{x) \ <£. Then we say that 
2(G) is m-uniformly dense on compacta in S. □ 

When 2(G) is w- uniformly dense on compacta in 5, then no matter how we choose 
an / in S, a compact subset K of U, or the accuracy of approximation e > 0, we can 
always find a single hidden layer feedforward network having output function g (in 
2(G)) with all derivatives of g \ u on K up to order m lying within £ of those of / on K. 
This is a strong and very desirable approximation property. 
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The space L p (U,fi) is the collection of all measurable functions / such that 
\\f\\p u n - [ \ v 1 / 1 p djJ.] llp <«»,1 </? < °°, where the integral is defined in the 

sense of Lebesgue. When fl =X we may write either \ fdX or /(*)<& to denote 

the same integral. We measure the distance between two functions f and g belonging to 
L p (U,fJ.) in terms of the metric p Pt u,n(f. 8 ) s 1 f~8 lip, U,p~ Two functions that differ 
only on sets of fJ, -measure zero have p Pt 8 ) = 0. We shall not distinguish between 
such functions. 

The first Sobolev space we consider is denoted S™(U,p), defined as the collection 
of all functions /in C m (U) such that \\D a f\\ p , UiP < ~ for all la I < m. We define 
the Sobolev norm ||/|| m ,p, u.v = (2, „ , i jD a fF,,u,n) llp - The Sobolev metric is 

«)■«/"* \\m.p. V.H f.ge SZ(U,H). 

Note that p™ p depends implicitly on U, but we suppress this dependence for notational 
convenience. The Sobolev metric explicitly takes into account distances between 
derivatives. Two functions in are close in the Sobolev metric p PtP when all 

derivatives of order 0 < I a I < mare close in L p metric. 
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We also consider the Sobolev spaces 


L Uoc (t/) I a“/€ L p (U,l),0< la I <m }. 

This is the collection of all functions having generalized derivatives belonging to 
L p (U,X) of order up to m. Consequently, W™(U) includes S™(U,X), as well as 
functions that do not have derivatives in the classical sense, such as piecewise 
differentiable functions. 

The norm on W p (U) generalizes that on S p (U,h); we write it as 
\\fim,p,v s ( I Wf\\l.V.O yp feKW- 

\ a \ £m 

For the metric on W p (U ) we suppress the dependence on U and write 
P?(f,g)=\\f-8\\m,p,U f'geWpm 

Two functions are close in the Sobolev space W p (U) if all generalized derivatives are 
close in L p (U, A) distance. 
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Our results make fundamental use of one last function space, the space C? ( R r ) 
of rapidly decreasing functions in C°°( R r ). CX ( R r ) is defined as the set of all 
functions in C°°( R r ) such that for all multi-indices a and 0, x^D a f(x)-^0 as 
\ x | — » oo, where and I x I smax 1Sl -^ r I*,- I. Note that 

CSX R r )czCX( R r \ 

Desired results: 

1. ) £(G) is m- uniformly dense on compacta in CJ (R r ), S^{U,X) 

2. ) E(G) is Pp iM -dense in S™( R r , /i) 

3. ) Z(G) isp” -dense in W™(U) 
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2. APPROXIMATION RESULTS 


THEOREM 3.1: Let G * 0 belong to Sf ( JR, A) for some integer m > 0. Then L(G) 

i 

is m-uniformly dense on compacta in CX ( JR r ). □ 

DEFINITION 3.2: Let / e {0} uHV be given. G is l-finite if G e C l { IR ) and. 
0<{ I D l G I dX <oo. □ 

LEMMA 3.3: If G is /-finite then for all 0 < m < l there exists H e Sf ( IR,X\H * 0, 
such that L(//) c £(G). □ 

/-finite activation functions G with J Z ) l G dX ^ 0 have J \D m G | dX = 00 for all m < l, 
and for m > / all /-finite activation functions G have J D m G dX = 0 (provided D m G 
exists). 

It is informative to examine cases not satisfying the conditions of the theorems. For 
example, if G = sin then G e C°°( JR), but for all /, J I D^G I = 00 . If G is a 

polynomial of degree m then again G e C°°( JR), but for / < m we have 
J | Z^G | dX =<», although j I D l G \ dX =0 for / > m. Consequendy, neither 
trigonometric functions nor polynomials are /-finite. 
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COROLLARY 3.4: If G is /-finite, then for all 0 ^ m < /, £(G) is m-uniformly dense 
on compacta in CX ( IR r )- D 

COROLLARY 3.5: If G is /-finite, 0 < m < /, and £/ is an open subset of IR r then 
1(G) is m-uniformly dense on compacta in Sp (U, X ) for 1 < p < □ 

COROLLARY 3.6: If G is /-finite andp is compactly supported, then for all 0 £ m £ / 
£(G) c p) and E(G) ispp >/z -dense in 5” ( ^ r ,p). 

COROLLARY 3.8: If G is /-finite, 0 £ m < /, U is an open bounded subset of and 
Co ( JR T ) is p” -dense in W^(U) then Z(G) is also pj-dense in 

These results rigorously establish that sufficiently complex multilayer feedforward 
networks with as few as a single hidden layer are capable of arbitrarily accurate 
approximation to an unknown mapping and its (generalized) derivatives in a variety of 
precise Senses. The conditions imposed on G are relatively mild; the conditions required 
of U have practical implications. 


74 



x i Y,i Y 21 Y 12 Y 22 X 2 

Figure 1. Feedforward Network 

O input unit (x) multiplication unit 
GO activation unit © addition unit 

Note: biases not shown 
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Figure 2. Derivative Network 


O input unit © multiplication unit 

GO activation unit © addition unit 
DGO activation derivative unit 

Note: biases not shown 
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ABSTRACT 


Recently, multiple input, single output, single hidden layer, feedforward 
neural networks have been shown to be capable of approximating a nonlinear map 
and its partial derivatives. Specifically, neural nets have been shown to be dense 
in various Sobolev spaces (Homik, Stinchcombe and White, 1989). Building 
upon this result, we show that a net can be trained so that the map and its 
derivatives are learned. Specifically, we use a result of Gallant (1987b) to show 
that least squares and similar estimates are strongly consistent in Sobolev norm 
provided the number of hidden units and the size of the training set increase 
together. We illustrate these results by an application to the inverse problem of 
chaotic dynamics: recovery of a nonlinear map from a time series of iterates. 
These results extend automatically to nets that embed the single hidden layer, 
feedforward network as a special case. 
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3. LEARNING RESULTS 


SETUP. We consider a single hidden layer feedforward network having network 
output function 

gx (*.«)= | PjG(x T yj) 

j= 1 

where x represents an r x 1 vector of network inputs (including a "bias unit"), (ij 
represents hidden to output layer weights, y, represents input to hidden layer 
weights, K is the number of hidden units, 

6' = (Pi,y\,P2>Y2 > • • • >Pk,Yk) > 
and G is the hidden unit activation function. 

We assume that the network is trained using data {y t ,x t } generated 
according to 

yt=g*(xt) + e t t= 1, 2, ..., n . 

x t denotes the observed input and e* denotes random noise. The number K n of 
hidden units employed depends on the size n of the training set. The network is 

a 

trained by finding gK n ( x > & ) that minimize 

j„(0) = - £ to - Z Pptf r,)] 2 . 

n <= i j=i 

subject to the restriction that gK„(*> 0 ) is a member of the estimation space Q. 
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REGULARITY CONDITIONS: 


Input space. The input space X is the closure of a bounded, open subset of IR r . 

Parameter space. For some integer w, 0 < m < some integer p, 1 <p < ®°, and 
some bound B, 0 < B < °o, g* is a point in the Sobolev space < W> m +[r/p]+i,p, x 311(1 

II £ IU+[r/p]+l,p, x < B' 

Activation function. The activation function G belongs to C m (lR ) and 
\°° (d m /du m )G(u) du <<*>. See Section 3 of Homik, Stinchcombe and White 

— oo 

(1989). 

Estimation space. £*„(*. 0) is restricted to Q= {g: \\g\\m+[r/p]+i,p, x^B] in 
the optimization of s n (g). 

Training set. The empirical distribution of converges to a distribution 

p(x) an dp(O) > 0 for every open subset 0 of X 

Error process. The errors {et ) src independently and identically distributed with 
common probability law P having j^eP (de ) = 0 and 0<^_ e 2 P(de) <°°. 

(j e 2 P (de) = 0 implies e t = 0 for all t .) 
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Independence. The probability law P of the errors does not depend on {*,}r=i; 
that is, P(A) can be evaluated without knowledge of 

lim rt _**>(l/n)£" =1 x t , etc. 
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THEOREM 1 . Under the Regularity Conditions 

lim || g * - **„( ’ , 0 ) |U, oo, a: = 0 almost surely 

provided lim w _^o K n = °° almost surely. In particular, 

lim G [g Kn (x, d)]=a (g*) almost surely 

provided a is continuous with respect to || • || m> x ’ D 
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4. EXAMPLE: LEARNING CHAOTIC MAP 


Our investigation studies the ability of the single hidden layer network 


K 

gK&t- 5 , . . . , %-i) = X pjG(y 5j x t - 5 + ■ • • +Yij x,_i + Yoj) 

;'=i 


with logistic squasher 

G(w) = 1/[1 + exp(-w)] 

to approximate the derivatives of a discretized variant of the Mackey-Glass 
equation (Schuster, 1988, p. 120) 


g(Xf_5» %t— l) — x t — l (10.5) 


(0-2)*, -5 

1 + (*,-5)'° 




The values of the weights (5j and y/y that minimize 
S n (gK) = ~ I, ~ 8 k( x I-5 . • • • > ^-l)] 2 

n t=l 

were determined using the Gauss-Newton nonlinear least squares algorithm. Our 
rule relating K to n was of the form K « log (n) because asymptotic theory in a 
related context (Gallant, 1989) suggests that this is likely to be the relationship 
that will give stable estimates. 
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Figure 1 . Superimposed nonlinear map and neural net estimate 
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. Superimposed derivative and neural net estimate 
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and Hardware Implementation 
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Heng Mui Keng Terrace, Kent Ridge 
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Abstract 

The conventional inference composition algorithm of fuzzy controller is very time and 
memory consuming. As a result, it is difficult to do real time fuzzy inference and most 

fuzzy controllers are realized by look-up tables. In this paper we derived a simplified 
algorithm using the defuzzification mean of maximum. This algorithm takes shorter 

computation time and needs less memory usage, thus making it possible to compute the 
fuzzy inference on real time and easy to tune the control rules on line. The responsibility 
of this algorithm is proved mathematically in this paper. 

Fuzzy controller has been highly developed and come to a new stage of hardware 
implementation. Many fuzzy controllersfor so called fuzzy inference machines) in 
hardware have been available in the market. The conventional fuzzy inference algorithm 
on which most fuzzy controller based on is too complicated. Further, its hardware 

implementation is very expensive and of a large volume, and the inference speed is 
limited. Reducing its cost and volume and improving its inference speed are very 

important to this technology. In this paper we also describe a hardware implementation 
based on the above simplified fuzzy inference algorithm. 


1. Fuzzy controller algorithm 

Assume that the fuzzy controller has two inputs and a single output as shown 
in Figure 1, 


Input 


X 

B 


Fuzzy Relation 

c 

R 

iT 


Output 


Fig.l The block graph of fuzzy controller 
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where A and B are the linguistic variables of the inputs, with universe of 
discourse X and Y respectively, and C is the linguistic variable of the output, 
with universe of discourse U. We emphasize here that X and Y are not 
necessarily continuous on the real line R, but arbitrary subsets of R. 

Let the sets of linguistic values concerning with A, B and C respectively be as 
follows 


(Aj) e7KX), (ie I) 

(1) 

(Bj) € TOO, (je J) 

(2) 

(Ck)eWJ), (k£ K) 

(3) 


where I={1, 2, •••, m}, J={1,2, n}, K— {1, 2, h), and !F(X) represents the fuzzy 

power set of X. 

The fuzzy control rules are described in terms of a group of multi-complexed 
fuzzy implications as follows: 


If A is Aj and B is Bj then C is Ck, 
(i e I, j e J, k = <p(i, j) e K ) 


(4) 


The above fuzzy implications can be translated into a three-dimensional 
relation R according to the fuzzy Compositional Rule of Inference(CRI method). 


Definition 1. R = u (Ai x Bj x Ck) 

>.j 

Re f(X x Y x U), (5) 

R(x, y, u) = V (Ai(x) A Bj(y) A Ck(u)). 

«.j 

(k = <p(i, j) e K ) 


Suppose that the inputs of the fuzzy controller at a certain instance are fuzzy 
sets A* e /(X) and B* e ?(Y), according to the CRI method, the output of the 

controller will be the fuzzy set denoted by C* <= ^F(U), i.e 
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C* = (A* xB*)°R 

C*(u) = sup (A*(x) A B*(y) A R(x, y, u)) 

XG X 

y e y 

= sup ((A*(x) A B*(y)) A (V (Aj(x) A Bj(y) A j>(u)))) 
xg X i.j 

y € y 

= sup ( V ((A*(x) A Ai(x» A (B*(y) A Bj(y)) A C^j, j)(u))> 
xgX i.j 

y e Y 

= V sup((A*(x) A Ai(x)) A sup(B*(y) A Bj(y)) A C«p(i, j)(u) (6) 

i»j xg X yeY 

In actual applications the inputs of the controller (i.e the observed values of 
the controlled process) are some definite real numbers. Suppose in a certain 
instance the observed value is a pair (x 0 , y<>), then the fuzzy sets of inputs A* 

and B* are as follows. 



1, x=x c 1, y=y c 

A*(x) = { , B*(y) = { 

0, x*x Q 0, y*y c 

(7) 

so that 

sup(A*(x) A Ai(x)) = Ai(xo) 

(B) 


xgX 



sup(B*(y) A Bj(y)) = Bj(y 0 ) 

(9) 


yeY 


therefore 

C*(u) = V (Ai(xo) A Bj(yo)) A C^s, j>(u) 



( 10 ) 

(i e I, j e J, <p(i, j) e K) 

2. The responsibility of the fuzzy controller 
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The responsibility of a fuzzy controller has been defined and analyzed in depth 
by P. Z. Wang and S. P. Lou[3], here we discuss the responsibility of fuzzy 
controller under a weaker condition. 

Definition 2 For a set of linguistic values concerning with A is {Ai}(i e I) e f(X), 
I=(l, 2, — » m}, where A* is a normally distributed fuzzy set, there 
exists m+1 real numbers 

r Q < ri < T 2 < ... <Tni » 

such that for any given x e (ii.j, r,), if j * i, Aj(x) < Aj(x) (see 
Figure 2). We called Ji = Om, r*) c X, the interval of Aj, i e I, and 
N 1 = { r ; } the net of A, where Ji and N 1 satisfy the following: 


Ji n Jj = O , ( i * j ) ( n ) 

u Ji = X - { r, } ( 12 > 

i=l 



For {Ai} ef(X), {Bj} e iF(Y), we have (13) 

(Ai x Bj)(x, y) 5 Ai(x) A Bj(y) 

V i € I, V j e J, I={1, 2, •••, m), J={1,2, • •, n}. (14) 

If there exist nets N*= (r’i) , = (r"j) and intervals { J'i) * (J"j) f° r A 

and B respectively, then 


98 





( 15 ) 


(x, y) e J st > A s (x) A B t (y) > Ai(x) A Bj(y) 

( (s, t) * (i, j) ) 

where Jy = J'j x J"j, is called the interval of (Ai x Bj). 

In fact, 

(x, y) e J st * x e T s , ye J" t > A s (x) > Ai(x), B t (y) > Bj(y) 

(s, t)* (i, j) 

> min(A s (x), B t (y» > min(Aj(x), Bj(y» 

> A s (x) A B t (y) > Ai(x) A Bj(y) (16) 

we define the net of A x B as 


N s {(x, y) I x = r'i, y = r"j) (17) 

According to the definition of responsibility of fuzzy controller given in [3], we 
state the following definition with slight changes. 

Definition 3 A Fuzzy controller is said to be responsive if there exists an 
interval L c (-<*> , +°° ) such that 

L = {u I C*(u) = hgt C*(u)}. (18) 

u e U 

where hgtC*(u) is the height of fuzzy set C*(u) and L is the 
responsive interval. 

If a fuzzy controller is responsive, the output of the controller, according to the 
defuzzification mean of maximum, is 

u 0 = M(L) (u 0 e U), (19) 

where M(L) means the mid-point of L. 

Theorem 1, A given fuzzy controller is responsive as long as there exists a Net 
N of A x B such that the intersection of N and the universe of 
discourse (X x Y) is empty, i.e 
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Nn2L = <l>,(2L = Xx Y) 

Proof: Assume that N is the Net of A x B, which satisfies formula (20), i.e 


(20) 


2L = 2L-N < 21 > 

from formula (21), we derive that 

V Jij = X. = (X x Y) (22) 

».j 

so for any definite (x 0 , y 0 ) e (X x Y), there exist s, t, such that 
(x 0 , y 0 ) € Jst- 

From formula (16), for any s,i e I, tj e J, if (s, t) * (i, j), then 

A s (x 0 ) A B t (y 0 ) > Ai(x Q ) A Bj(y 0 ) (23) 

According to formula (10), the response of the fuzzy controller is as 
follows 


C*(u) = V (Ai(xo)) A Bj(y 0 ) A Cp(i, j)(u) 

U (24) 

By formula (23), it is obvious that 

M(C*) = M(f) ( 25 > 

Where M(C*) = {u I C*(u) = hgtC*(u)} (26) 

M(f) = {u I f(u) = hgt f(u)} (27) 

f(u) S (A s (xo)) A Bt(y 0 ) A C<,,(s, t )(u) (28) 

As it is known that C<p(i, j) e {Ck) * ^ is a distributed fuzzy set whose 
kernel is 

Ker(C k ) = (u I C k (u) = 1 ) * 4> (29) 


obviously, there exists an interval L c (-°° , +°° ) such that 
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L — {u I C<p( s> t)(u) 2 (A s (x 0 )) A B t (y 0 ))} 

= {ul f(u) = (A s (xo)) A Bt(yo))} 

= {u I C*(u) = hgtC*(u)} (30) 

therefore the fuzzy controller is responsive and u 0 = M(L). 


3. The simplification of fuzzy controller algorithm 


In the right-hand side of formula (10), there are I x J terms of union 
operations. The ordinary algorithm does this calculations term by term and is 
very time consuming. We know from Theorem 1 that when a fuzzy controller is 
responsive and the defuzzification mean of maximum is used, we only need to 
calculate the interval L, then the mid-point of L will be the desired output of 
the fuzzy controller. Thus for all observed value (xo, yo), we only have to 

calculate f(u), only one of the terms in the formula (10). This will simplify the 
computation algorithm to a great extent. 


Let 


A 2 = V Aj 
i€l 


(31) 


AsW = V Aj(x) 
iel 


(32) 


Bj = V Bj (33) 

jeJ 

Bi(y) = V Bj(y) (34) 

j€J 


where x e X-{r’i), y e Y-{r"j}. The membership functions of Aj(x) and Bx(y) are 
shown in Figure 3 -a. 
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Figure 3-b The function of pk(P) 
Clearly, when (x, y) e J s t, i.e x e {r' s -l, r'sK y € r "t} 


Aj(x) = A s (x) > Ai(x), (35) 

Bx(y) = B t (y) > Bj(y), (36) 

A Z (x) A B 2 (y) = A s (x) A B t (y) > Ai(x) A Bj(y) (37) 

V s,i e I, t j e J, (s, t) * (i, j) 

Define the separating functions <pi(x), q>2(y) respectively as follows, 

<p i (x) = i, xe {r'i-i, r'i), (38) 

<p2(y)=j, ye (r"j-i, r"j). (39) 


For C e {Ck}, k e K, we define the following function 
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pk(P) = M(L) = M(C k p), P e [0, 1] 


(40) 


where L c (-°°, +°°), 


LnU={ulCk>P) (41) 

where Ckp is the P- cuted set of Ck and M(-) represents the mid-point of (•) as 
shown in Figure 3-b. Since Ck(k € K) is normally distributed set, Pk(P) is a 
continuous single-valued function of P, V p e [0, 1] 

So far as the functions Aj(x), Bjty), <pi(x), <P2(y) and pk(P) are defined, we can 
derive the following simplified algorithm for the responsive fuzzy controllers: 

1) Given the inputs (x 0 , y c ) of the fuzzy controller, calculate 


a = As(x 0 ), b = Bj(y 0 ), (42) 

s = cpi(x 0 ), t = <p2(y 0 )> (43) 

P = a A b = min(a,b). (44) 

2) Calculate 

k = <p(s, t), k e K (45) 


where the <p is determined by the given control rules. 

3) Finally, the output of the fuzzy controller can be obtained from 

u 0 = Pk(P) (46) 

Obviously, L n U = {u I Ck ± P) 

= (u I Qp (s , t ) * (A s (xo)) A B t (y 0 ))} (47) 

u 0 = Pk(P) = M(L) 

It can be seen that the result is exactly the same as that in formula (30). 

The conventional fuzzy controller algorithm is very time consuming and needs 
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large memory space so that it is hardly possible to implement the fuzzy 
composition inference on line in a control system. In many applications, fuzzy 
controllers used look up tables instead of real time inference. Not only it is 
impossible to tune the fuzzy control rules on line, it takes a great amount of 
computation time to calculate the fuzzy controller look-up table. The simplified 
algorithm proposed above reduces the computation greatly and its calculating 
time is nearly the same as that taken by the conventional PID control 
algorithm. This makes it possible to do real time fuzzy inference in the 
controller, allowing the tuning of control rules on line. If the algorithm is used 
to calculate the fuzzy control look-up table, it takes less than one minute. Since 
we only need to store 5 functions, namely Ai(x), Bi(y), <pi(x), <p2(y) and pk(P) 
instead of all the Ai(x), Bj(y), and Ck(u), a total of I+J+K functions. 

4. Hardware Implementation 

Fuzzy controller has been highly developed and come to a new stage of 
hardware implementation. Many fuzzy controllers(or so called fuzzy inference 
machines) in hardware are available in the market[4][5]. The conventional 
'fuzzy inference algorithm on which most fuzzy controllers are based on is too 
'complicated. Further, its hardware implementation is very expensive and of a 
large volume, and the inference speed is limited. Reducing its cost and volume 
and improving its inference speed are very important to this technology. 

As can be seen from the last section that with the proposed algorithm, the 
calculation is much simpler as there is no computation of fuzzy sets and most of 
the calculations involve only function operations and comparative operations. 
Therefore, this fuzzy control algorithm is very easy to implement in hardware. 
The main issue in a hardware design is to construct some function generators 
generating Aj(x), Bj(y), <Pi(x), <P2(y) and Pk(P)» while the complicated fuzzy set 
operation which is difficult to turn into hardware counterparts is avoided. 
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Figure 4 Block diagram of the fuzzy controller board 


We have designed a fuzzy controller board for Personal Computers(PC) based 
on the above algorithm. The principle of the fuzzy controller board is 
illustrated in Figure 4. The board is composed of some function generators to 
generate A;e(x), Bx(y), <pi(x), Cp2(y) and pk(P), a comparator to do the operation 
of Min(a, b) and a rule base to store the control rules. Each part is constructed 
with digital IC. The detailed design of the hardware will be presented in depth 
in our future papers. 

The controller board is connected to the CPU through the data bus of the PC. 
The generators of Aj(x), Bj(y), <pi(x), <p2(y) and pk(P) and the control rules can 
be programmed conveniently. Using this board with its software environment 
on a personal computer, it is very flexible to construct a fuzzy control system 
for an industrial process in which large number of data needed to be processed. 
This is the reason why we design a fuzzy controller board instead of an 
independent fuzzy controller machine which is unable to process data and 
information. 

Due to its fuzzy inference function and ability of data processing, the fuzzy 
control system can be applied not only to the control system but also to many 
other areas such as expert systems, pattern recognition and decision making 
where the fuzzy inference method may be employed. 
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Radar Signal Categorization Using a Neural Network 


Abstract 

Neural networks were used to analyze a complex simulated radar 
environment which contains noisy radar pulses generated by many 
different emitters. The neural network used is an energy 
minimizing network (the BSB model) which forms energy minima — 
attractors in the network dynamical system — based on learned 
input data. The system first determines how many emitters are 
present (the deinterleaving problem). Pulses from individual 
simulated emitters give rise to separate stable attractors in the 
network. Once individual emitters are characterized, it is 
possible to make tentative identifications of them based on their 
observed parameters. As a test of this idea, a neural network 
was used to form a small data base that potentially could make 
emitter identifications. 


We have used neural networks to cluster, characterize and identify radar signals 
from different emitters. The approach assumes the ability to monitor a region of the 
microwave spectrum and to detect and measure properties of received radar pulses. 
The microwave environment is assumed to be complex, so there are pulses from a number 
of different emitters present, and pulses from the same emitter are noisy or their 
properties are not measured with great accuracy. 

For several practical applications, it is important to be able to tell quickly , 
first, how many emitters are present and, second, what their properties are. In 
other words time average prototypes must be derived from time dependent data without 
a tutor. Finally the system must tentatively identify the prototypes as members of 
previously seen classes of emitter. 


Stages of Processing. We accomplish this task in several stages. Figure 1 
shows a block diagram of the resulting system, which contains several neural 
networks. The system as a whole is referred to as the Adaptive Network Sensor 
Processor (ANSP). 


Figure 1 About Here 


In the block diagram given in Figure 1, the first block is a feature extractor. 
We start by assuming a microwave radar receiver of some sophistication at the input 
to the system. This receiver is capable of processing each pulse into feature 
values, i.e. azimuth, elevation, signal to noise ratio (normalized intensity), 
frequency, and pulse width. This data is then listed in a pulse buffer and tagged 
with time of arrival of the pulse. In a complex radar environment, hundreds or 
thousands of pulses can arrive in fractions of seconds, so there is no lack of data. 
The problem, as in many data rich environments, is making sense of it. 
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The second block in Figure 1 is the deinterleaver which clusters incoming radar 
pulses into groups, each group formed by pulses from a single emitter. A number of 
pulses are observed, and a neural network computes, off line, how many emitters are 
present, based on the sample, and estimates their properties. That is, it solves the 
so-called deinterleaving problem by identifying pulses as being produced by a 
particular emitter. This block also produces and passes forward measures of the each 
cluster's azimuth, elevation, SNR, frequency and pulse width. 

The third block, the pulse pattern e xtractor , uses the deinterleaved information 
to compute the pulse repetition pattern of an emitter by using the times of arrival 
for the pulses that are contained in a given cluster. This information will be used 
for emitter classification. 

The fourth block, the tracker, acts as a long term memory for the clusters found 
in the second block, storing the average azimuth, elevation, SNR, frequency, and 
pulse width. Since the diagram in Figure 1 is organized via initial computational 
functionality, the tracking module follows the deinterleaver so as to store its 
outputs. In an operationally organized diagram, the tracker is the first block to 
receive pulse data from the feature extractor. It must identify most of the pulses 
in real time as previously learned by the deinterleaver module and only pass a small 
number of unknown pulses back to the deinterleaver module for further learning. The 
tracker also updates the cluster averages. Their properties can change with time 
because of emitter or receiver motion, for example. 

The fourth and fifth blocks, the tracker and the classifier operate as a unit to 
classify the observed emitters, based on information stored in a data base of emitter 
types. Intrinsic emitter properties stored in these blocks are frequency, pulse 
width and pulse repetition pattern. 

The most important question for the ANSP to answer is what the emitters might be 
and what can they do. That is, "who is looking at me, should I be concerned, and 
should I (or can I) do something about it?" 


Emitter Clustering. Most of the initial theoretical and simulation effort in 
this project has been focused on the deinterleaving problem. This is because the 
ANSP is being asked to form a conception of the emitter environment from the data 
itself. A teacher does not exist for most interesting situations. 

In the simplest case, each emitter emits with constant properties, i.e. no 
noise is present. Then, determining how many emitters were present would be trivial: 
simply count the number of unique pulses via a look up table. Unfortunately, data is 
often moderately noisy because of receiver, environmental and emitter variability, 
and, sometimes, because of the frequent change of one or another emitter property at 
the emitter. Therefore, simple identity checks will not work. It is these later 
cases which this paper will address. 

Many neural networks are supervised algorithms, that is, they are trained by 
seeing correctly classified examples of training data and, when new data is presented 
will identify it according to their past experience. Emitter identification does not 
fall into this category because the correct answers are not known ahead of time. 
That, after all, is the purpose of this system. The basic problem of a 
self-organizing clustering system has many historical precedents in cognitive 
science. For example, William James, in a quotation well known to developmental 
psychologists, wrote around 1890, 


109 



..the numerous inpouring currents of the baby bring to his 
consciousness . . . one big blooming buzzing Confusion. That 
Confusion is the baby's universe; and the universe of all of us 
is still to a great extent such a Confusion, potentially 
resolvable, and demanding to be resolved, but not yet actually 
resolved into parts. 


William James (1890, p.29) 


We now know that the new born baby is a very competent organism, and the 
outlines of adult perceptual preprocessing are already in place. The baby is 
designed to hear human speech in the appropriate way and to see a world like ours: 
that is, a baby is tuned to the environment in which he will live. The same is true 
of the ANSP, which must process pulses which will have feature values that fall 
within certain parameter ranges. That is, an effective feature analysis has been 
done for us by the receiver designer, and we do not have to organize a system from 
zero. This means that we can use a less general approach than we might have to in a 
less constrained problem. The result of both evolution and good engineering design 
is to build so much structure into the system that a problem, very difficult in its 
general form, becomes quite tractable. 

At this point, neural networks are familiar to many. Introductions are 
available, for example, McClelland and Rumelhart, 1986; Rumelhart and McClelland, 
1986; Hinton and Anderson, 1989; Anderson and Rosenfeld, 1988. 

The Linear Associator. Let us begin our discussion of the network we shall use 
for tHe radar problem with the ’outer product' associator, also called the ’linear 
associator,' as a starting point. (Kohonen, 1972, 1977, 1984; Anderson, 1972). We 
assume a single computing unit, a simple model neuron, acts as a linear summer of its 
inputs. There are many such computing units. The set of activities of a group of 
units is the system state vector. Our notation has matrices represented by capital 
letters (A), vectors by lower case letters (f,g), and the elements of vectors as f(i) 
or g(j). A vector from a set of vectors is subscripted, for example, f^, f 2 • ■ • 

The i th unit in a set of units will display activity g(i) when a pattern f(j) is 
presented to its inputs, according to the rule, 


g(i) = EA(i,j)f(j). 

j 

where A(i,j) are the connections between the i th unit in an output set of units and 
the jth unit in an input set. We can then can write the output pattern, g, as the 
matrix - multi plication 

g = A f. 


During learning, the connection strengths are modified according to a 
generalized Hebb rule, that is, the change in an element of A, SA(i,j), is given by 
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&A(i,j) « f(j) g(i), 
k k 

where f and g are vectors associated with the kth learning example, 
k k 

Then we can write the matrix A as a sum of outer products, 

n T 
A = t) I g f 
k=l k k 

where h is a learning constant. 

Prototype Formation The linear model forms prototypes as part of the storage 
process, a property we will draw on. Suppose a category contains many similar items 
associated with the same response. Consider a set of correlated vectors, {f^}* with 
mean p. 


f = p + d . 
k k 


The final connectivity matrix will be 

n T 
A = hZ g f 
k=l k 

T n T 

- Hg (n p + Z d ) 
k=l k 

If the sum of the d^ is small, the connectivity matrix is approximated by 

T 

a = hp g p • 

The system behaves as if it had repeatedly learned only one pattern, p, and responds 
best to it, even though p, in fact, may never have been learned. 

Concept forming systems. Knapp and Anderson (1984) applied this model directly 
to the formation of simple psychological 'concepts' formed of nine randomly placed 
dots. A 'concept' in cognitive science describes the common and important situation 
where a number of different objects are classed together by some rule or similarity 
relationship. Much of the power of language, for example, arises from the ability to 
see that physically different objects are really 'the same' and can be named and 
responded to in a similar fashion, for example, tables or lions. A great deal of 
experimentation and theory in cognitive science concerns itself with concept 
formation and use. 

There are two related but distinct ways of explaining simple concepts in neural 
network models. First, there are prototype forming systems, which often involve 
taking a kind of average during the act of storage, and, second, there are models 
which explain concepts as related to attractors in a dynamical system. In the radar 
ANSP system to be described we use both ideas: we want to construct a system where 
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the average of a category becomes the attractor in a dynamical system, and an 
attractor and its surrounding basin represent an individual emitter. (For a further 
discussion of concept formation in simple neural networks, see Knapp and Anderson, 
1984; Anderson, 1983, and Anderson and Murphy, 1986). 

Error C orrection. By using an error correcting technique, the Vidrow-Hoff 
procedure , ”we can Force the simple associative system to give us more accurate 
associations. Let us assume we are working with an autoassociative system. Suppose 
information is represented by associated vectors f., + f,, f « *► f o * ’ ' * A v ® ctor > 

f, . is selected at random. Then the matrix, A, is incremented according to the rule 
k’ 

T 

M = tl (f - Af) f 
k k k 

where AA is the change in the matrix A. In the radar application, there is no 
'correct answer' in the general sense of a supervised algorithm. However every input 
pattern can be its own 'teacher' in the error correction algorithm in that the 
network will try to better reconstruct that particular input pattern. The goal of 
learning a set of stimuli {f} is to have the system behave as 

A f - f 
k k 

The error correcting learning rule will approximate this result with a least mean 
squares approximation, hence the alternative name for the Widrow-Hoff rule: the LMS 

(least mean squares) algorithm. The autoassociative system combined with error 
correction, when working perfectly, is forcing the system to develop a particular set 
of eigenvectors with eigenvalue 1. 

The eigenvectors of the connection matrix are also of interest when simple 
Hebbian learning is used in an autoassociative system. Then, the simple outer 
product associator has the form 


T 

t]f f ■ 
k k 

There is now an obvious connection between the eigenvectors of the resulting 
outer product connectivity matrix and the principal components of statistics, because 
the form of this matrix is the covariance matrix. In fact, there is growing evidence 
that many neural networks are doing something like principal component analyis. 
(See, for example, Baldi and Hornik, 1989 and Cottrell, Munro and Zipser, 1988). 

BSB: A' Dynamical System. We shall use for radar clustering a non-linear model 
that "talces - the basic linear associator, uses error correction to construct the 
connection matrix, and uses units containing a simple limiting non-linearity. 
Consider an autoassociative feedback system, where the vector output from the matrix 
is fed back into the input. Because feedback systems can become unstable, we 
incorporate a simple limiting non-linearity to prevent unit activity from getting too 
large or too small. Let f[i] be the current state vector describing the system. 
f[0] is the vector at step 0. At the i+lst step, f[i+l], the next state vector, is 
given by the iterative equation, 
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f[i+l] = LIMIT [ a A f [i] + y f[i] + 8 f[0] ]. 

We stabilize the system by bounding the element activities within limits. 

The first term, ocAf[i], passes the current system state through the matrix and 
adds information reconstructed from the autoassociative cross connections. The 
second term, yf[i]> causes the current state to decay slightly. This term has the 
qualitative effect of causing errors to eventually decay to zero as long as y is 
less than 1. The third term, 5f[0], can keep the initial information constantly 
present and has the effect of limiting the flexibility of the possible states of the 
dynamical system since some vector elements are strongly biased by the initial input. 

Once the element values for f[i+l] are calculated, the element values are 
'limited', that is, not allowed to be greater than a positive limit or less than a 
negative limit. This is a particularly simple form of the sigmoidal nonlinearity 
assumed by most neural network model. The limiting process contains the state vector 
within a set of limits, and we have previously called this model the 'brain state in 
a box' or BSB model. (Anderson, Silverstein, Ritz, and Jones, 1977; Anderson and 
Mozer, 1981) The system is in a positive feedback loop but is amplitude limited. 
After many iterations, the system state becomes stable and will not change: these 

points are attractors in the dynamical system described by the BSB equation. This 
final state will be the output of the system. In the fully connected case with a 
symmetric connection matrix the dynamics of the BSB system can be shown to be 
minimizing an energy function. The location of the attractors is controlled by the 
learning algorithm. (Hopfield, 1982; Golden, 1986). Aspects of the dynamics of this 
system are related to the 'power' method of eigenvector extraction, since repeated 
iteration will leada to activity dominated by the eigenvectors with the largest 
postive eigenvalues. The signal processing abilities of such a network occur because 
eigenvectors arising from learning uncorrelated noise will tend to have small 
eigenvalues, while signal related eigenvectors will be large, will be enhanced by 
feedback, and will dominate the system state after a number of iterations. 

We might conjecture that a category or a concept derived from many noisy 
examples would become identified with an attractor associated with a region in state 
space and that all examples of the concept would map into the point attractor. This 
is the behavior we want for radar pulse clustering. 


Neural Network Clustering Algorithms. We know there will be many radar pulses, 
but wi 3o not know the detailed descriptions of each emitter invoved. We want to 
develop the structure of the microwave environment, based on input information. A 
number of models have been proposed for this type of task, including various 
competitive learning algorithms (Rumelhart and Zipser, 1986; Carpenter and Grossberg, 
1987). 

Each pulse is different because of noise, but there are only a small number of 
emitters present relative to the number of pulses. We take the input data 
representing each pulse and form a state vector with it. A sample of several hundred 
pulses are stored in a 'pulse buffer.' We take a pulse at random and learn it, using 
the Widrow-Hoff error correcting algorithm with a small learning constant. Since 
there is no teacher, the desired output is assumed to be the input pulse data. 

Learning rules for this class of dynamical system, Hebbian learning in general, 
(Hopfield, 1982) and the Widrow-Hoff rule in particular, are effective $t 'digging 
holes in the energy landscape' so they fall where the vectors that are learned are. 
That is, the final low energy attractor states of the dynamical system when BSB 
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dynamics are applied will tend to lie near or on stored information. Suppose we 
learn each pulse as it comes in, using Widrow Hoff error correction, but with a small 
learning constant. Metaphorically, we 'dig a little hole' at the location of the 
pulse. But each pulse is different. So, after a while, we have dug a hole for each 
pulse, and if the state vectors coding the pulses from a single emitter are not too 
far apart ~Tn state space, we have formed an attractor that contains all the pulses 
from a single emitter, as well as new pulses from the same emitter. Figure 2 
presents a (somewhat fanciful) picture of the behavior that we hope to obtain, where 
many nearby data points combine to give a single broad network energy minimum that 
contains them all. 


Figure 2 about here 


We can see why this behavior will occur from an informal argument. Call the 
average emitter state vector of a particular emitter p. Then, every observed pulse, 
f k , will be 

f = p + d , 
k k 

where d. is a distortion, which will be assumed to be different for every individual 
pulse, that is, different d. are uncorrelated, and are relatively small compared to 
p. With a small learning constant, and with the connection matrix A starting from 
zero, the magnitude of the output vector, Af, will also be small after only a few 
pulses are learned. This means that the error vector will point outward, toward f^, 
that is, toward p+d^, as shown in Figure 3. 


Figure 3 about here 


Early in the learning process with a small learning constant for a particular 
cluster, the error vectors (input minus output) all will point toward the cluster of 
input pulses. Widrow Hoff learning can be described as using a simple associator to 
learn the error vector. Since every d^ is different and uncorrelated, the error 
vectors from different pulses will have the average direction of p. The matrix will 
act as if it is repeatedly learning p, the average of the vectors. It is easy to 
show that if the centers of different emitter clusters are spaced far apart, in 
particular, if the cluster centers are orthogonal, then p will be close to an 
eigenvector of A. In more interesting and difficult cases, where clusters are close 
together or the data is very noisy, it is necessary to resort to numerical simulation 
to see how well the network works in practice. As we hope to show, this technique 
does work quite well. 

After the matrix has learned so many pulses that the input and output vectors 
are of comparable magnitude, the output of the matrix when p + d^ is presented will 
be near p. (See Figure 4) Then, 
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P S! Ap. 

Over a number of learned examples, * 

total error * E (p+d - A(p+d ) 

k k 

« E (d - Ad ) 
k k 

The maximum values of the eigenvalues of A are 1 or below, the d's are uncorrelated, 
and this error term will average to zero. 


Figure 4 about here 


However, as the system learns more and more random noise, the average magnitude 
of the error vector will tend to get longer and longer, as the eigenvalues of A 
related to the noise become larger. Note that system learning never stops because 
there is always an error vector to be learned, which is a function of the intrinsic 
noise in the system. Therefore, there is a 'senility' mechanism found in this class 
of neural networks. For example, the covariance matrix of independent, identically 
distributed Gaussian noise added to each element is proportional to the identity 
matrix, then every vector becomes an eigenvector with the same eigenvalue, and this 
matrix is the matrix toward which A will evolve, if it continues to learn random 
noise indefinitely. When the BSB dynamics are applied to matrices resulting from 
learning very large numbers of noisy pijlses, the attractor basins become fragmented, 
so that the clusters break up. However, the period of stable cluster formation is 
very long and it is easy to avoid cluster breakup in practice. (Anderson, 1987) 

In BSB clustering the desired output is a particular stable state. Ideally, all 
pulses from one emitter will be attracted to that final state. Therefore a simple 
identity check is now sufficient to check for clusters. This check is performed by 
resubmitting the original noisy pulses to the network that has learned them and 
forming a list of the stable states that result. The list is then compared with 
itself to find which pulses came from the same emitter. For example, a symbol could 
be associated with the pulses from the same final state, i.e. the pulses have been 
deinterleaved or identified. 

Once the emitters have been identified, the average characteristics of the 
features describing the pulse (frequency, pulse width and pulse repetition pattern) 
can be computed. These features are used to classify the emitters with respect to 
known emitter types in order to 'understand' the microwave environment. A two stage 
system, which first clusters and then counts clusters is easy to implement, and, 
practically, allows convenient 'hooks' to use traditional digital techniques in 
conjunction with the neural networks. 


Stimulus Coding and Representation. The fundamental represention assumption of 
almost all neural networks is that information is carried by the pattern or set of 
activities of many neurons in a group of neurons. This set of activities carries the 


115 



meaning of whatever the nervous system is doing and these sets of activities are 
represented as state vectors. The conversion of input data into a state vector, that 
is, the representation of the data in the network, is the single most important 
engineering problem faced in network d esign . In our opinion, choice of a good input 
and output representation is usually more important for the ultimate success of the 
system than the choice of a particular network algorithm or learning rule. 

We now suggest an explicit representation of the radar data. From the radar 
receiver, we have a number of continuous valued features to represent: frequency, 

elevation, azimuth, pulse width, and signal strength. Our approach is to code 

continuous information as locations on a topographic map, i.e. a bar graph or a 
moving meter pointer. We represent each continuous parameter value by location of 
block of activation on a linear set of elements. Increase in a parameter value moves 
the block of activity to the right, say, and a decrease, moves the activity to the 
left. We have used a more complex topographic representation in several other 
contexts, with success. (Sereno, 1989} Rossen, 1989} Viscuso, Anderson, and Spoehr, 
1989). 

We represent the block/bar of activity value with a block (three or four) 
equal, symbols placed in a region of " period, symbols. Single characters are 
coded by eight bit ASCII bytes. The ASCII l's and O' s are further transformed to 
+l's and -l's, so that the magnitude of any feature vector is the same regardless of 
the feature value. Input vectors are therefore purely binary. On recall, if the 
vector elements coding a character do not rise above a threshold size, the system is 
not 'sure' of the output. Then that character is represented as the underline, 
character. Being 'not sure' can be valuable information relative to the confidence 
of a particular output state relative to an input. Related work has developed a more 
numeric, topographic representation for this task, called a 'closeness code (Penz, 
1987) which has also been successfully used for clustering of simulated radar data. 

Neural networks can incorporate new information about the signal and make good 
use of it. This is one version of what is called the data fusion or sensor fusion 
problem. To code the various radar features, we simply concatenate the topographic 
vectors of individual feature into a single long state vector. Bars in different 
fields code the different quantities. Figure 5 shows these fields. 


Figure 5 about here 


Below we will gradually add information to the same network to show the utility 
of this fusion methodology. The conjecture is is that adding more information about 
the pulse will produce more accurate clustering. Note that we can insert 'symbolic' 
information (say word identifications or other appropriate information) in the state 
vector as character strings, forming a hybrid code. For instance the state vector 
can contain almost unprocessed spectral data together with the symbolic bar graph 
data combined with character strings representing symbols at the same time. 

A Demonstration. For the simulations of the radar problem that we ^describe 
next, “we used a BSB system with the following properties. The system used 480 units, 
representing 60 characters. Connectivity was 25%, that is, each element was 
connected at random to 120 others. There were a total of 10 simulated emitters with 
considerable added intrinisic noise. A pulse buffer of 510 different pulses was used 
for learning and, after learning, 100 new pulses, 10 from each emitter were used to 
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test the system. There were about 2000 total learning trials, about that is, about 
four presentations per example. Parameter values were a = 0.5, y = 0.9 and 5=0. 

The limits for thresholding were +2 and -2. None of these parameters were critical, 
in that moderate variations of the parameters had little effect on the resulting 
classifications of the network. 

* 

Suppose we simply learn frequency information. Figure 6 shows the total number 
of attractors formed when ten new examples of each of ten emitters were passed 
through the BSB dynamics, using the matrix formed from learning the pulses in the 
pulse buffer. In a system that clustered perfectly, exactly 10 final states would 
exist, one different final state for each of the ten emitters. However, with only 
frequency information learned, all the 100 different inputs mapped into only two 
attractors. 


Figure 6 about here 


Figure 6 and others like it below are graphical indications of the similarity 
between recalled clusters or states with computational energy minima. The states 
shown in the figures are ordered via a priori knowledge of the emitters, although 
this information was obviously not given to the network. One can visually interpret 
the outputs for equality of two emitters ( lumping of different emitters) or 
separation of outputs for a single emitter ( splitting of the same~ emltter) in the 
outputs. This display method is for the reader's benefit. The ANSP system 
determines the number and state vector of separate minima by a dot product search of 
the entire output list, as discussed above. Position of the bar of ' = 's codes the 
frequency in the frequency field which is the only field learned in this example. 

Let us now give the system additional information about pulse azimuth and 
elevation. Clustering performance improves markedly, as shown in Figure 77 \?e get 
nine different attractors. There is still uncertainty in the system, however, since 
few corners are fully saturated, as indicated by the underline symbols on the corners 
of some bar's. States 1 and 3 are in the same attractor, an example of incorrect 
'lumping' as a result of insufficient information. Two other final states (8 and 9) 
are very close to each other in Hamming distance. 


Figure 7 about here 


Let us assume that future advances in receivers will allow a quick estimation of 
the microstructure of each radar pulse. We have used, as shown in Figure 8, a coding 
which is a crude graphical version of a Fourier anlysis of an individual pulse, with 
the center frequency located at the middle of the field. Emitter pulse spectra were 
assigned arbitrarily. 
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Figure 8 about here 


Note that the spectral information can be included in 
slightly processed form: we have included almost 
spectrum. 


the state vector in only 
a caricature of the actual 


Addition of spectral information improved performance somewhat. There were nine 
distinct attractors, though still many unsaturated states. Two emitters were still 
'lumped', 8 and 9. Figure 9 shows the results. 


Figure 9 about here 


Suppose we add information about pulse width to azimuth, elevation, and 
frequency. The simulated pulse width inforiiaHon is very poor. It actually degrades 
performance, though it does allow separation of a couple of nearby emitters. The 
results are given in Figure 10. 


Figure 10 about here 


The reason pulse width data is of poor quality and hurts discrimination s 
because of a common artifact due to the way that pulse width is measured. Vhen two 
pulses occur close together in time a very long pulse width is measured by the 
receiver circuitry. This can give rise in unfavorable cases to a spurious bimodal 
distribution of pulsewidths for a single emitter. Therefore, a single emitter seems 
to have some short pulse widths and some very long pulse widths and this can spilt 
the category. Bimodal distributions of an emitter parameter, when the peaks are 
widely separated, is a hard problem for any clustering algorithm. A couple of 
difficult discriminations in this simulation, however, are aided by the additional 
data. 

We now combine all this information about pulse properties together. None of 
the subsets of information could perfectly cluster the emitters. Pulse width, in 
particular, actually hurt performance. Figure 11 shows that, after learning, using 
all the information, we now get ten well separated attractors, i.e. the correct 
number of emitters relative to the data set. The conclusion is that the additional 
information, even if it was noisy, could be used effectively. Poor information could 
be combined with other poor information to give good results. 
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Figure 11 about here 


Processing After Deinterleaving. Having used the ANSP system to deinterleave 
and cluster data, we also have a way of producing an accurate picture of each 
emitter. We now have an estimate of the frequency and pulse width and can derive 
other emitter properties (Penz et. < al., 1989), for example, the emitter pulse 
repetition pattern. One method to learn this pattern is to learn pulse repetition 
interval (PRI) pairs autoassociatively. Another is to autocorrelate the PRI's of a 
string. This technique probably provides more information than any other for 
characterizing emitters, because the resulting correlation functions are very useful 
for characterizing a particular emitter type. 


Classification Problem and Neural N etwork Data Bases . The next task is to 
classify thi observed imitters based on our previous experience with emitters of 
various types. We continue with the neural network approach because of the ability 
of networks to incorporate a great deal of information from different sensors, their 
ability to generalize (i.e. 'guess') based on noisy or incomplete information, and 
their ability to handle ambiguity. Known disadvantages of neural networks used as 
data bases are their slow computation using traditional computer architectures, 
erroneous generalizations (i.e. 'bad guesses'), their unpredictability, and the 
difficulty of adding new information to them, which may require time consuming 
relearning. 

Information, in traditional expert systems, is often represented as collections 
of atomic facts, relating pairs or small sets of items together. Expert systems 
often assume 'IF (x) THEN (y)' kinds of information representation. For example, 
such a rule in radar might look like: 

IF (Frequency is 10 gHz) 

AND (Pulse Width is 1 microsecond) 

AND (PRI is constant at 1 kHz) 

THEN (Emitter is a Klingon air traffic control radar). 


Problems with this approach are that rules usually have many exceptions, data 
may be erroneous or noisy, and emitter parameters may be changed because of local 
conditions. Expert systems may be exceptionally prone to confusion when emitter 
properties change because of the rigidity of their data representation. Neural 
networks allow a different strategy: Always try to use as much information as you 
have, because, in most cases, the more information you have, the better performance 
will be. 

As William James commented in the nineteenth century, 

. . . the more other facts a fact is associated with in the mind, 
the better posession of Tt our memory retains . Each of its 
associates becomes a hook - to which it hangs, a means to fish it 
up by when sunk beneath the surface. Together, they form a 
network of attachments by which it is woven into the entire 
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tissue of our thought. 


William James (1890). p. 301 


Perhaps, as William James suggests, information is best represented as large 
sets of correlated information. We could represent this in a neural network by a 
large, multimodal state vector. Each state vector contains a large number of ’atomic 
facts' together with their cross correlations. Our clustering demonstration showed 
that more information could be added and used efficiently and that identification 
depends on a cluster of information co-occuring. (See Anderson, 1986 for further 
discussion of neural network data bases of this type.) 

Ultimately, we would like a system that would tentatively identify emitters 
based on measured properties and previously known information. Since we know, in 
operation, that parameters can and often do change, we can never be sure of the 
answers . 

As a specific important example, radar systems can shift parameters in ways 
consistent with their physical design, that is, waveguide sizes, power supply size, 
and so on, for a number of reasons, for example, weather conditions. If an emitter 
is characterized by only one parameter, and that parameter is changed, then 
identification becomes very unlikely. Therefore, accuracy of measurement of a 
particular parameter may not be as useful for classification as one might expect. 
However, using a whole set of co-occuring properties, each at low precision, may 
prove a much more efficient strategy for identification. For further discussion of 
how humans often seem to use such a strategy in perception, consult George Miller's 
classic 1956 paper, "The magic number seven, plus or minus two." 


Classification Problem for Shifted Emi tters. Our first neural net 
classification simulation is specifically designed to study sensitivity to shifts in 
parameters. Two data sets were generated. One set has 'normal' emitter properties 
and the other set had all the emitter properties changed about 10 percent. The two 
sets each contained about 500 data points. The names used are totally arbitrary. 
The state vector was constructed of a name string (the first 10 characters) and bar 
codes for frequency, pulse width, and pulse repetition interval. For the 
classification function, the position of "+" symbols indicates the feature magnitude 
while the blank symbol fills the rest of the feature field. Again the symbol 
indicates an undecided node. 

Figures 12 and 13 show the resulting attractor interpretations. Figure 12 shows 
the vectors to be learned autoassociatively by the BSB model. The first field is the 
emitter name. The last three fields represent the numerical information produced by 
the deinterleaver and pulse repetition interval modules. An input consists of 
leaving the identification blank and filling in the analog information for the 
emitter which one wants an identification. The autoassocative connections fill in 
the missing identification information. 

Figure 12 shows the identifications produced when the normal set is provided to 
the matrix: all the names are produced correctly and in a small number of iterations 
through the BSB algorithm. Figure 13 uses the same matrix, but the input data is now 
derived from sources whose mean values are shifted about 10 percent, to emulate this 
parameter shift. 
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Figure 12 about here 


Figure 13 about here 


There were three errors of classification. Emitter 3 was classified as 'Airborn In' 
instead of 'AA FC' . Emittter 4 was classified as 'SAM target' instead of 'Airborn 
In'. Emitter 7 was classified as 'Airborn In' rather than the correct 'SAM Target' 
name. Note that the recalled analog information is also not exactly the correct 
analog information even for the correctly identified emitters. At a finer scale, the 
number of iterations required to reach an attractor state was very long. This is a 
direct measure of the uncertainty of the neural network about the shifted data. Some 
of the final states were not fully limited, another indication of uncertainty. 


Large Classification Data Bases. It would be of interest to see how the system 
worked with a larger data base. Some information about radar systems is published in 
Jane's Weapon Systems (Blake, 1988). We can use this data as a starting point to see 
H ifneural network might scale to larger systems. Figure 14 shows the kind of data 
available from Jane's. Some radars have constant pulse repetition frequency (PRF) 
and others have highly variable PRF's. (Jane's lists Pulse Repetition Frequency 
(PRF) in its tables instead of Pulse Repetition Interval (PRI). We have used their 
term for their data in this simulation.) We represented PRF variability in the state 
vector coding by increasing the last bar width (Field 7, Figure 15) for highly 
variable PRF's (see the Swedish radar, for an example.) Also, when a parameter is out 
of range (the average PRF of the Swedish radar) it is not represented. 


Figure 14 about here 


Figure 15 about here 


We perform the usual partitioning of the state vector into fields, as shown in 
Figure 15. For this simulation, the frequency scale is so coarse that even enormous 
changes in frequency would not change the bar coding significantly. We are more 
interested here in whether the system can handle large amounts of Jane's data. We 
taught the network 47 different kinds of radar transmitters. Some transmitter names 
were represented by more than one state vector because they can have several, quite 
different modes of operation, that is, the parameter part of the code can differ 
significantly from mode to mode. (The clustering algorithms would almost surely pick 
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up different modes as different clusters.) After learning, ve provided the measured 
properties to the the transmitter to see if it could regenerate the name of the 
country that the radar belonged to. There were only three errors of retrieval from 
47 sets of input data, corresponding to 94 percent accurate country identification. 
This experiment was basically coding a lookup table, using low precision 
representations of the parameter^. Figure 16 shows a sample of the output, with 
reconstructions of the country, designations, and functions. 


Figure 16 about here 


Conclusions. We have presented a system using neural networks which is capable 
of clustering and identifying radar emitters, given as input data large numbers of 
received radar pulses and with some knowledge of previously characterized emitter 
types . 

Good features of this system are its robustness, its ability to integrate 
information from co-occurance of many features, and its ability to integrate 
information from individual data samples. 

We might point out that the radar problem is similar to data analysis problems 
in other areas. For example, it is very similar to a problem in experimental 
neurophysiology, where action potentials from multiple neurons are recorded with a 
single electrode. Applications of the neural network techniques described here may 
not be limited to radar signal processing. 
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Figures, Anderson, Gately, Penz and Collins 
Caption, Figure 1 

Block diagram of the radar clustering and categorizing system. 


Caption, Figure 2 

Landscape surface of system energy. Several learned examples may 
contribute to the formation of a single energy minimum which will 
correspond to a single emitter. This drawing is only for illustrative 
purposes and is not meant to represent the very high dimensional 
simulations actually used. 


Caption, Figure 3 

The Widrow— Hoff procedure learns the error vector. The error 
vectors early in learning with a small learning constant point toward 
examples, and the average of the error vectors will point toward the 
category mean, i.e. all the examples of a single emitter. 


Caption, Figure 4 

Assume an eigenvector is close to a category mean, as will be the 
result after extensive error correcting, autoassociative learning. 
The error terms from many learned examples, with a small learning 
constant, will average to zero and the system attractor structure will 
not change markedly. (There are very long term 'senility' mechanisms 
with continued learning, but they are not of practical importance for 
this application.) 
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Late in Learning Process 
Small Learning Constant, Many Examples 
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Input states 
Cluster prototype 


Figures, Anderson, Gately, Penz and Collins 


Figure 5 


Radar Pulse Fields: Coding of Input Information 
Position of the bar of codes an analog quanitity 


Azimuth Elevation Frequency Pulse Width Pseudo-spectra 
l< >l< > I < >K >I< >1 


In any field: A move to the left decreases the quantity 

A move to the right increases the quantity 


Caption, Figure 5 

input representation of analog input data uses bar codes. The 
state vector is partitioned into fields, corresponding to azimuth, 
elevation, frequency, pulse width, and a field corresponding to 
additional information that might become available with advances in 
receiver technology. 
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Figures, 


Anderson, Gately, Penz and Collins 


Figure 6 


Clustering by Frequency Information Only 

Emitter Final Output State 

Number 


Azimuth Elevation Frequency 
I< >I< >I< 


Pulse Width Pseudo-Spectra 
>I< >I< >1 


1 

2 

3 

4 

5 
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7 

8 
9 

10 , 


Caption, Figure 6 

Final attractor states when only frequency information is 
learned. Ten different emitter are present, but only two different 
output states are found. 
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Figures, Anderson, Gately, Penz and Collins 


Figure 7 


Clustering Using Azimuth, Elevation and Frequency Information 

Emitter Final Output State 

Nurabe r 


Azimuth Elevation Frequency Pulse Width Pseudo-spectra 

I< >K >I< >I< >I< >1 
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Caption, Figure 7 

When azimuth, elevation and frequency are provided for each data 
point, performance is better. However, two emitters are lumped 
together, and three others have very close final states. 
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Figures, Anderson, Gately, Pens and Collins 


Figure 8 


a) - 

Monochromatic pulse. 


b ) • **»*•*•*•*# • 

Subpulses with distinct 
(Or some kinds of FM or 

frequencies . 
phase modulation) 

c ) • * »**■■*■** • • 

Continuous frequency sweep during the puls 
i.e. pulse compression) 


Caption, Figure 8 

Suppose we can assume that advances in receiver technology will 
allow us to incorporate a crude 'cartoon' of the spectrum of an 
individual pulse into the coding of the state vector representing an 
example. The spectral information can be included in the state vector 
in only slightly processed form. 
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Figures, Anderson, Gately, Penz and Collins 


Figure 9 


Spectrum, Azimuth, Elevation, Frequency 

Emitter Final Output State 

Number 


Azimuth Elevation Frequency Pulse Width Pseudo-spectra 

K >I< >I< >I< >I< >1 




Caption, Figure 9 

Including pseudo-spectral information helped performance 
considerably. Only two emitters are lumped and the other emitters are 
well separated. 
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Figures, Anderson, Gately, Penz and Collins 


Figure 10 


Pulse Width, Azimuth, Elevation and Frequency 

Emitter Final Output State 

Numbe r 


Azimuth Elevation Frequency 
K >I< >K 


Pulse Width Pseudo-spectra 
> I < >I< >1 
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8 







9 

10 


Caption, Figure 10 

Suppose we add pulse width information to our other information. 
Pulse width data is of poor quality because when two pulses occur 
close together, a very long pulse width is measured by the re f®i^ er 
circuitry. This gives rise to a bimodal distribution of pulsewidths, 
and the system splits one category. 
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Figures, Anderson, Gately, Penz and Collins 


Figure 11 


Clustering With All Information 

Emitter Final Output State 

Numbe r 


Azimuth Elevation Frequency Pulse Width Pseudo-spectra 

I< >K >K >I< >I< >1 
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Caption, Figure 11 

When all available information is used, ten stable, well 
separated attractors are formed. This shows that such a network 
computation can make good use of additional information. 
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Figures, Anderson, Gately, Penz and Collins 


Figure 12 


Learn normal set. Test normal set 

Name Frequency P W PRI 

>1 >1 >1 


1 SAM Target+++ 

2 Airborn In +++ 
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10 SAM Target 
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Caption, Figure 12 

We can attach identification labels to emitters along with 
representations of their analog parameters. The names and values used 
here are random and were chosen arbitrarily. 
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Figures, Anderson, Gately, Penz and Collins 


Figure 13 


Learn Normal Set, Test Set with Shifted Parameters 


Name Frequency P W PRI 

I >1 >1 >1 > 


1 
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8 
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10 


SAM Target+++ _+_ ++_ 

Airborn In +++ ++ 

Airborn In +++ _+_ +_ 

SAM Target +++ _+ ++ 

Airborn In +++ ++ 

Airborn In +++ _+_ ++ 

Airborn In +++ ++ 

SAM Target +++ +++ _ 

SAM Target _++_ _ ++ 


SAM Target +++ ++ 


x error 
x error 


x error 


Caption, Figure 13 

Even if the emitter parameters shift slightly, it is still 
possible to make some tentative emitter identifications. Three errors 
of identification were made. Neural networks are able to generalize 
to some degree, if the representations are chosen properly. The names 
and values used here are random and were chosen arbitrarily. 
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Figures, Anderson, Gately, Penz and Collins 


Figure 14 


Sample Data Obtained from Jane's Weapon Systems 

Three Radars from Jane's: 

China, JY-9, Search 
Frequency : 2.0 - 3.0 gHz 
Pulse Width : 20 microseconds 
PRF : 0.850 kHz 

PRF variance: Constant frequency 

Sweden, UAR1021, Surveillance 
Frequency : 8.6 - 9.5 gHz 
Pulse Width : 1.5 microseconds 
PRF : 4.8 - 8.1 kHz 

PRF Variance: 3 frequency staggered 

USA, APQ113, FireControl 
Frequency : 16 - 16.4 gHz 
Pulse Width : 1.1 microseconds 
PRF : 0.674 kHz 

PRF Variance: None (Constant frequency) 


Caption, Figure 14 

Sample data on radar transmitters taken from Jane's 
Systems. (Blake, 1988). 


Weapon 
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Figures, Anderson, Gately, Penz and Collins 


Figure 15 


Coding into Partitioned State Vector: 


Symbolic Fields: 


Continuous Fields: 


Field 1 Country 

Field 2 Designation 

Field 3 Purpose 


Field 4 
Field 5 
Field 6 
Field 7 


Frequency 
Pulse Width 
PRF 

PRF Variation 


I- 






7 


ChinaRY-9 Searc...— 

SwedeUARlOSurve 

USA. .APQUFireC 


Analog Bar Code Ranges: 


Frequency: 0 
Pulse Width: 0 
PRF: 0 
PRF Variance: 0 


14 gHz 

10 microseconds 
4 kHz 

200% of average PRF 


Caption, Figure 15 

Bar code representation of Jane's data. Note the presence of 
both symbolic information such as country name and transmitter 
designation, and analog, bar coded information such as frequency, 
pulse width, etc. 
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Figures, Anderson, Gately, Penz and Collins 


Figure 16 


X 


X 


X 


Data Retrieval: Data from Jane's Weapons Systems 
only part of the data 

Final output states: 3 errors in reconstructed country 
1 2 3 4 5 6 7 

I >i >i >i— >1 >1 >1 > 


ChinaRY-9 Searc... 
USA. . FPS24Searc- . . 
China571. .Surve.— 
China581 . .Warni .- 
China311-AFireC. . 
FrancTRS20Surve . . 
lndiaPSM-3Searc. . . 
EnglaAS3-_FireC. .« 
EnglaMARECMarin. . . 

USA. . FPS24Searc-. . 
USA.. PAR OAppro... 
IsraeELM?2Marin. . . 
USA. ,_PR20Appro. . . 

USA. ,TPS43FireC. . . 
USA. .APQllFireC. . . 
USA. ,APS12Surve. . . 
IsraeELM22Marin. . . 
IsraeELM20FireC.-. 
SwedeGi raf Searc . . . 
SwedeUARlOSurve . . . 
USSR.BarloSearc. . . 
IsraeELM20Fi reC* . . 
USSR.FireCFireC. . . 
USSR.HenSeWarni . .- 
USSR.KnifeWarni-. . 
USSR. JayBiAirbo. . . 


Caption, Figure 16 

When only analog data is provided at the input, the network will 
fill in the most appropriate country name. In this trial simulation, 
a network learned 47 different transmitters and was able to correctly 
retrieve the associated country in 43 of them. 
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ABSTRACT 

Kohonen's "feature maps’ approach to clustering is often likened to the k or c-means clustering algorithms. In 
this note we identify some similarities and differences between the hard and fuzzy c-Means (HCM/FCM) or 
ISODATA algorithms and Kohonen's "self-organizing" (KSO) approach. We conclude that some differences 
are significant, but at the same time there may be some important unknown relationship(s) between the two 
methodologies. We propose several avenues of research which, if successfully resolved, would strengthen 
both the HCM/FCM and Kohonen clustering models. We do not, in this note, address aspects of the KSO 
method related to associative memory and to the feature map display technique. 

1. INTRODUCTION 

Treatments of many classical approaches to clustering appear in Kohonen [1], Bezdek [2], and Duda and Hart 
[3]. Kohonen's work has become particularly timely in recent years because of the widespread resurgence of 
interest in Artificial Neural Network (ANN) structures. ANNs and pattern recognition are discussed by Pao [4] 
and Lippman [5]. Our interest lies with the KSO algorithm as it relates to the solution of clustering and 
classification problems and the HCM/FCM models. 

2. CLUSTERING ALGORITHMS AND CLASSIFIER DESIGN 

Let (c) be an integer, 1< c < n and let X - {x.p x 2 , .... x n } denote a set of (n) feature vectors in Jl s . X is 

numerical object data-, the j-th object (some physical entity such as a medical patient, seismic record etc.) has 
vector Xj as it's numerical representation; Xj k is the k-th characteristic (or feature) associated with object j. Given 

X, we say that (c) fuzzy subsets {Uj:X* [0,1]} are a fuzzy c-partition of X in case the (cn) values [u jk - Uj(x k ), 1 <, 
k <, n, 1 ^ i <, c} satisfy three conditions: 

0<u jk £ 1 for alii, k (la) 
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ru ik = 1 for all k 


(1b) 


0<Zu jk <n for alii . ( 1c ) 

Each set of (cn) values satisfying conditions (1) can be arrayed as a (cxn) matrix U * [u jk ]. The set of all such 
matrices are the non-degenerate fuzzy c-partitions of X: 

Mfcp = {U in Jl, 00 1 u jk satisfies (l)fbraHiandk). (2) 

And in case all the u jk 's are either 0 or 1 , we have the subset of hard (or crisp) c-partitions of X: 

= {U in | u jk - 0 or 1 tor all i and k}. (3) 

The reason these matrices are called partitions follows from the interpretation of Uj k as the membership of x k in 
the i-th partitioning subset (cluster) of X. M fcn is more realistic as a physical model than M cn , for it is common 

experience that the boundaries between many classes of real objects are in fact very badly delineated (i.e., 
really fuzzy). The important point is that all clustering algorithms generate solutions to the clustering problem 
for X which are matrices in Mf Cn . The clustering problem forX, is, quite simply, the identification of an "optimal" 

partition U of X in M fcn ; that is, one that groups together object data vectors (and hence the objects they 

represent) which share some well defined (mathematical) similarity. It is our hope and implicit belief, of course, 
that an optimal mathematical grouping is in some sense an accurate portrayal of natural groupings in the 
physical process from whence the object data are derived. The number of clusters (c) must be known, or 
becomes an integral part of the problem. 

3. THE ISODATA AND KSO ALGORITHMS 

The most well known objective function for clustering is the least total squared error function: 

J 1 (U,v;X)-SSu ik (||x k -v j || I ) 2 , ( 4 ) 

where v = (v^ , V 2 v c ) is a vector of (unknown) cluster centers (weights or prototypes), Vj e #L S for 1 5 i < c, 

U g M cn is an unknown hard c-partition of X, and ||*||j is the Euclidean norm on 51 s . Optimal partitions U‘ of X 
are taken from pairs (U*, v‘) that are "local minimizers" of J 1 . It is important to recognize the geometric impact 
that the use of a norm function in as the criterion of (dis)similarity has on "good clusters" (here ||»||j , but 

more generally, any norm on 51 s induced by a positive definite weight matrix A, as described below). Figure 1 
illustrates this graphically; partitions that optimize J 1 will, generally speaking, contain clusters that conform to 
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the topology that is induced on H s by the eigenstructure of the norm-inducing matrix A. When A - 1, good 
clusters will be hyperspherical, as the one in the left portion of Figure 1 ; otherwise, they will be hyperelliptical, 
as the one on the right side of Figure 1 . 

Figure 1. Geometry of Cluster Formation In Norm-Driven Clustering Algorithms 



As is evident in Figure 1 , clusters that optimize J 1 are formed on the basis of two properties: location and 

shape. Location information is contained in the lengths of the data vectors and "cluster centers" or prototypes 
{vj from the origin, whilst shape information is embedded in the topology induced by the norm in use. 

Roughly speaking, these correspond to the mean and variance of probability distributions, so (4) is in some 
sense analogous to regarding the data as being drawn from a mixture of probability density functions (indeed, 
there are special cases when (4) yields identical results to the maximum likelihood estimators of the parameters 

of a mixture of normal distributions). Although the norm shown in (4) is the EucRdean norm, generalizations of 
J 1 have used all five of the usual norms encountered in numerical analysis and pattern recognition - viz, the 

Euclidean, Diagonal and Mahalonobis inner product A-norms; and the p « 1 and p - « (city block and sup) 
Minkowski norms. The defining equations and unit ball shapes for these two families of norms are shown in 
Figure 2. 

As an explicit means for finding optimal partitions of object data, J-j was popularized as part of the ISODATA 
("Iterative Self-Organizing Data Analysis") algorithm (c-Means + Heuristics) by Ball and Hall [6] in 1967. It is 
interesting to note that Kohonen apparently first used the term "self-organizing" to describe his approach 
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about 15 years later [1]. Apparently, the feature of both algorithms that suggests this phrase is their ability to 
iteratively adjust the weight vectors or prototypes that subsequently represent the data in an orderly and 
improving manner as the algorithms proceed with iteration. We contend that this use of the term "self- 
organizing" in the current context of neural network research is somewhat misleading (in both cases). Indeed, 
if the aspect of FCM/HCM and KSO that entitles us to call them self-organizing is their ability to adjust their 
parameters during "training", then every iterative method that produces approximations from data is self- 
organizing (e.g., Newton's method!). On the other hand, if this term serves to indicate that the algorithms in 
question can find meaningful labels for objects, without external interference (labelled) training examples), 
then all clustering algorithms are "self-organizing". Since the terminology in both cases is well established, the 
only expectation this writer has about the efficacy of these remarks is that they caution readers take the 
semantics associated with much of the current Neural Network Gterature with a large grain of salt. 


Figure 2. Geometry of Level Sets for Inner product A-norms and Minkowski p-norms 


Unit Ball Shapes in the A - norms 
L a ■ {x : <x,x> A = x T Ax = (||x|| A ) 2 = 1} 



11 x k~ v i Ha * «V v i> T *(» k - v i« (1/2> 
EV's of pos-definite (A) Induce shapes 

Inner product : Hilbert Space Structure 

Differentiable in All Variables 


Unit Bali Shapes in the p - norms 
L p s{x : ||x|| p = 1} 



ll» k - v |llp= (Z |X k) |P) (,/p) 
l|X k - v ,ll, = (Z |x k) -<(, I) 
tlx k -»|ll_= (max {|x ig - *11 IB 
p = 2 : Hilbert ; p * 2: Banach Spaces 
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Dunn [7] first generalized J 1 by allowing U to be fuzzy (m=2 below) and the norm to be an arbitrary inner 
product A-norm. Bezdek [8] generalized Dunn's functional to the fuzzy ISODATA family written as: 

JfJU.v; X) = ^11^11/ . (5) 

where me [1 , <*>) is a weighting exponent on each fuzzy membership; U e Is a fuzzy c-partition of X; v « 

(v 1 , v 2 v c ) are cluster centers in P. s ; A = is any positive definite (s x s) matrix; and (l|x k -Vj || A ) 2 - (x k -Vj) T A 

(x k -v j ) is the OG distance (in the A norm) from X|< to Vj . 

In 1979, Gustafson and Kessel [8] derived necessary conditions to minimize an extension of (5) with (c) 
different norm inducing matrices. In 1981 Bezdek et. al. [9] generalized (5) by allowing the prototypes to be 
(convex combinations of) linear manifolds of arbitrary and different dimensions. In 1985 Pedrycz[10] 
introduced a way to use partially labeled data with (5) that amounts to a mixed supervised-unsupervised 
clustering scheme. In 1989 Dave [11] introduced a generalization of (5) that uses hyperspherical prototypes 
for v. In 1990 Bobrowski and Bezdek [12] used the city block and sup norms with (5), thus extending the c- 
Means algorithms to the most important Minkowski norms (p = 1 and p = <*>)- 

Necessary conditions that define iterative algorithms for (approximately) minimizing J m and its generalizations 

are known. Our interest lies with the cases represented by (4) and (5). The conditions that are necessary for 
minima of J 1 and J m follow ; 

Hard c-Means (HCM) Theorem f21. (U, v) may minimize II u jk (||x k - Vj|| A ) 2 only if 

u jk - 1; (||x k -v i || A )2 = min j {(||x k -v j || A )2}; and=0; otherwise (6a) 

Vj - lu jk x k /lu jk (6b) 


Note that HCM produces hard clusters U e M cn . The HCM conditions are necessary for "minima , 'of (4) (i.e., 
with A=l, the Euclidean norm on Jl s ), and, as we shall note, are also used to derive hard clusters in the KSO 
algorithm. The well known generalization of the HCM conditions is contained in the: 

Fuzzv c-Means (FCM) Theorem f2LJ U.v) may minimize Huj k m (||x k - Vj|| A ) 2 for m > 1 only if : 

Uj k - (Z(l|Xk-vill A /IIXk-VjllA)^- 1 ))- 1 (7a) 
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(7b) 


vj - £W m VW" 


The FCM conditions are necessary for minima of (5). There is an alternative equation for (7a) if one or more of 

the denominators in (7a) is zero. These equations converge to the HCM equations as m-*1 from above, and 
for (m > 1), the U in FCM is truly fuzzy, i.e., U e (M fcn - M cn ). The FCM algorithms are simple Picard iteration 

through the paired variables U and v. Because we want to compare this method to the KSO algorithm, we give 
a brief description of the FCM/HCM algorithms. 


(Parallel! c-Means 


<FCM/HCM 1> : Given unlabeled data set X « (x-j , x 2 x n } . Fix : 1 < c < n; 1 5 m < « (m=1 for HCM); 

positive definite weight matrix A to induce an inner product norm on R s ; and c, a small positive constant. 

<FCM/HCM 2>: Guess v Q = ( v 1 Q , v 2 Q v c 0 ) e R 08 (or, initialize U Q e M fcn ). 

<FCM/HCM 3>: For j = 1 to J: 

<3a> : Calculate Uj with {v j j_ 1 } ; 

<3b>: Update {Vj j^} to {Vj j} withUj ; 

<3o: JI max j { ||Vj ^ to v, j || } <; e, ibfiQ stop and put (U*.v*) = (Uj.vp ; Else : Next j 

This procedure is known to converge q-linearly from any initialization to a local minimum or saddle point (U ,v ) 
of J m . Note again that the update rule for the weights {Vj} at step <3b> is a necessary condition for minimizing 

J m . Moreover, all (c) weight vectors are updated using all (n) data points simultaneously at each pass; i.e., the 
weights {v.} are not sequentially updated as each x k is processed. This is why we call the above description a 

"parallel" version of c-means, as opposed to the well known sequential version. 

There is a sequential version of hard c-means (SHCM) that can be used to minimize J 1 , and readers should be 

aware that it may produce quite different results than HCM on the same data set. One iteration of the SHCM 
algorithm is as follows: beginning with some hard U, the centers {Vj} are calculated with (6b). Once the 

prototypes are known, one returns to update U. Beginning with x^, each point is examined, and moved from, 
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say, cluster i to cluster j, so as to maximize the decrease in (if possible) . Then the two affected centers {V| , 
Vj} and rows i and j of U are updated using equations (6) . One complete pass of SHCM consists of testing 
each of the n data points in X, and effecting a transfer at each point where a decrease in can be realized. 

SHCM terminates when a complete pass can be made without transfers. We mention this version of HCM 
because it is SHCM that most closely resembles the KSO algorithm. Figure 3 is a rough depiction of how the 
HCM method might begin; Figure 4 indicates a desirable situation at termination. In Figure 3 the initial hard 
clusters subdivide the data badly, and the overall mean squared error (the sum of squares of the solid line 
distances between data points and prototypes) is large; at termination, the prototypes lie "centered" in their 
clusters, the overall sum of squared errors is low, and the hard 2-partition subdivides the data "correctly" ( this 
is what happens if we are lucky !). 


Figure 3. An Initial 2-Partltlon and Prototypes for HCM 
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Figure 4. A (Benevolent) Final Configuration of 2-Partltlon and Prototypes for HCNI 



Kohonen's method differs from the c-means approach in several important ways. First, it is not a norm-driven 
scheme. Instead, the KSO method uses the geometric notion of orientation matching, depicted in Figure 5, 
as the basic measure of similarity between data points and cluster centers. Second, there is no partition U 
involved in the KSO algorithm. Instead, an initial set of cluster centers are iteratively updated without reference 
to partitions of the unlabeled data. The underlying geometry of the criterion of similarity is shown in Figure 5. 

The measure of similarity, as shown in Figure 5, is the angle between a data point x and prototype v (in the 
neural network community, the vectors (Vj) are often called "weight" vectors; each one being attached or 

identified with a "node" in the network). Information that the data set may contain about cluster shapes in 
feature space is lost (i.e., not used by cos(0)); and if the data are normalized at each step to be vectors of 
length 1, as they usually are in the KSO approach, location information is lost as well. Consequently, the 
geometry favored by the KSO criterion of similarity is data substructures that lie in angular cones emanating 
from the origin. We emphasize that in real data, either type of criterion - the c-means type norm driven 
measure, or the KSO angular measure - may or may not be appropriate for matching the data. As with all 
clustering problems, the question is not - which is better? the question is, which is better for this data set? In 
order to effect comparison with the c-means model, a brief description of Kohonen’s algorithm follows. 
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Figure 5. Geometry of Cluster Formation In Orientation Matching Clustering Algorithms 



f COS(0) a < X,V > a 1 - ||x-v|| 2 / 2 

3 


Kohonen's (KSO) Clustering Algorithm 

<KS01> : Given unlabeled , "ordered" data set X = {x 1 , x 2 x n } . Rx : 1 < c*; Choose update scale factors 

{«|} so that { ttj } -» 0; Setj = £(a.j ) 2 < « ; Choose update neighborhood "radii" { pj } e {0,1 ,2 c*}: 

<KS02> Guess (unit vectors) v Q - ( v i o • v 2 0 v c* 0 ^ 6 *’ CS 

<KS03> Forj = 1 to J: : Fork = 1 to n: 

<3a> : Rnd i*(k) st (||x^- ifj*^jllj) 2 = min{ (||x^- Vj^llj) 2 
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<3b>: For indices N*(k) = i*(k) , i*(k) ± 1 , i*(k) ± Pj Update v tj1 : 


v t j - v tH + ctj (x k - v t ) /|| v tH + Oj (xk- v t ) Hji otherwise, v (j - v, j _ 1 


( 8 ) 


Next k; Next j 

We have used c* instead of c in this procedure to emphasize the fact that Kohonen's method often uses 
"multiple'' prototypes, In the sense that even though (unbeknownst to us !) X contains only c clusters, it may 
be advantageous to look for c* > c cluster centers; this is a further difference between the c-means and KSO 

strategies. This is one form of Kohonen's approach; other update rules have been used. The geometry of the 
update rule for the weight vectors in (8) is depicted in Figure 6. Thus, if we are at point x k , as shown in Figure 

6, <3a> of the KSO algorithm simply finds the current prototype (v Q|d ) closest to x R in angle (minimizing the 
angle is equivalent to the formula in <3a>). If the current center is called - v j*( k ) as ' n Figure 6, then 
update equation (8) connects * v j*( k ) to the vector x k , rotates v 0 y to the new position v new , and 
finally normalizes v new - 

The KSO procedure is exactly like SHCM in that it updates (some subset of the) prototypes sequentially after 

the examination of each data point. Figure 7 indicates the geometry of the scheme specified in <3b>; the 
basic idea is that once the prototype v okJ closest to the current data point is found, all prototypes in a 

neighborhood of the "winner” are also updated. 

Figure 6. The Geometry of Kohonen's Updating Rule 
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Figure 7. KSO Updating of Prototypes In the Neighborhood N*(k) of "Winner" Vj,^ 



Although the "feature web" shown in Figure 7 is conceptualized here as being in R s , it has actually been 
displayed only the case s = 2. Kohonen has shown that this process converges, in the sense that the {v ( j}-> 
{v t *} as {«j }-»0, in the special case s=2. Moreover, the limiting {v t *} preserve a "topological ordering" property 

of the data set X on an array of output nodes associated to the weight vectors. Iteration in the KSO method 
thus trains the weight vectors {v ( *} so that they preserve "order" in the output nodes. As previously noted, 

the KSO method does not use or generate a partition U of the data during training. However, once the weight 
vectors stabilize, the KSO model produces a hard U by following the nearest prototype rule below. 

More specifically, once a set of prototypes {Vj} are found by "training" on some data set X (this includes all four 

methods described above, HCM, FCM SHCM and KSO), they can be used to label any unlabeled data set. 
For any vector x e £. s , the HCM equation for u jk defines a (piecewise linear) nearest prototype classifier: 


The Nearest Prototype Classifier Decision Rule : Given {Vj} Compute, non-iteratively, the hard c- 
partrtion of (any) data X with HCM equation (6a): 
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u 


ik = 


1 ; ((llxk- Vjll ! )) 2 - miry (||X|<- Vj|| j ) 2 } 
0; otherwise 


( 9 ) 


Note that we have written (9) with the Euclidean norm. Theorem 2 suggests that any scalar product induced 
A-norm might be used in the formula; however, interpretation of the subsequent decision rule as discussed 
above becomes very difficult. Thus, while it makes sense geometrically to consider variations in the norm as in 
(7) while searching for the cluster centers, it is much less clear that norms other than the Euclidean norm 
should be used during classification. Figure 8 is a rough depiction of how the KSO method might begin; 
Figure 9 shows the situation after termination of KSO, followed by a posteriori application of (9) to find an 
"optimal" hard c-partition U corresponding to the final weights . A question about how rule (9) is used with the 
KSO prototypes remains: how do we, without labeled data, assign one of c < c* "real" labels to subsets of the 
c* weight vectors found by the KSO scheme? The same question applies to FCM - we still need to decide 
which of the c "real" labels belongs to each prototype - the problem is just more pronounced when there are 
multiple prototypes for each class. 


Figure 8. Initial Configuration of Weight Vectors in the KSO Scheme 
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Figure 9. Terminal Weight Vectors and an HCM Partition In the KSO Scheme 



4. DISCUSSION AND CONCLUSIONS 

First, we itemize the major differences between ISODATA and KSO : 

(D1) FCM HCM and SHCM are intrinsic clustering methods - i.e., one of their inputs is an unknown partition, 
and one of their outputs is a partition of unlabeled data set X which is optimal in the sense of minimizing a 
norm driven objective function. The KSO method, on the other hand, needs an a posteriori rule such as 
the nearest prototype rule at (9) to generate a partition of the data non-iteratively. We might call this an 
extrinsic clustering scheme. Moreover, without labeled data that can be used to discover which subsets 
of the c‘ multiple prototypes found by the KSO scheme should be identified with each of the c classes 
assumed in (9), there is no general way to even implement (9) with the KSO rule. Thus, much must be 
added to KSO to make it a true clustering method. 

(D2) The data set X is used differently. KSO uses the data sequentially (locally) and hence, its outputs are 
dependent on local geometry and order of labels, whereas ISODATA utilizes the data globally, and 
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updates both the weights and partition values in parallel at each pass. In this sense KSO is most akin to 
Sequential Hard c-Means, which is also sensitive to ordering of labels - this is often regarded as a fatal flaw 
in clustering. 

(D3) KSO can have multiple prototypes for each class; ISODATA has but one. In clustering, the usual 
assumption is that c is unknown, and one resorts to various cluster validity schemes to validate the results 
of any algorithm. Since the KSO scheme uses many prototypes, without assuming an underlying True 
but unknown" number of clusters, this is advantageous to the user. However, the dilemma of how to 
convert the prototypes into clusters, as discussed in (D1), persists. 

(D4) KSO uses local orientation (cos 9 = <x,v>) on the unit ball as the measure of similarity between data and 
weights, whereas ISODATA uses cluster shape (via the eigenstructure of A) and locat i o n (via the lengths 
of the weights and the data) to assess (dis)similarity between the data and prototypes. Thus, the c-Means 
approach has a much more "statistical" flavor than KSO. On the other hand, KSO uses the dot product at 
each node, in the spirit of the McCulloch-Pitts neuron. Thus, local computations in the KSO scheme 
proceed on the basis assumed by many workers in neural network research, and make the KSO scheme 
more easily identifiable with this type of computational architecture. 

(D5) KSO preserves "order" in a certain sense; ISODATA does not. This property of the KSO method is 
perhaps its most interesting distinction. There is little hope that c-Means has a similar property. Since 
cognitive science assures us that one aspect of intelligence is its inherent ability to order, this aspect of 
the KSO approach again shows well in its favor. A significant line of research concerns whether or not the 
FCM/HCM models possess this, or any similar property. 


(D6) Weight updates in the KSO method are intuitively appealing; weight updates in ISODATA are 
mathematically necessary. Since the update formula in c-Means finds either real or generalized centroids, 
we might claim that this scheme is also intuitively appealing. In this regard the c-Means algorithms 
(including SHCM) have a clear theoretical advantage, at least in terms of justification of the procedure 

used. 


(D7) FCM, HCM and SHCM are all well-defined optimization problems; KSO is an heuristic procedure. An 
interesting question about KSO is this: what function is being optimized during iteration? An answer to 
this question would be both useful and illuminating. The criterion functions that drive FCM, HCM and 
SHCM are well understood geometrically and statistically; discovery of a criterion function for Kohonen s 
algorithm might supply a great deal of insight about other properties of the algorithm and its outputs. 
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(D8) KSO partitions have so far been generated with the nearest prototype rule and the Euclidean norm, 
whereas FCM, HCM and SHCM can be used with any inner product and two Minkowski norms. Much 
research can be done on the issue of how best to use the Kohonen prototypes to find cluster 

substructure. There are many natural ways besides the nearest prototype rule to use KSO outputs with 
the weights {Vj}. For example, one could simply distribute unit memberships satisfying (1b) across the 

KSO nodes at each step using distance proportions. This generalizes Kohonen's model from a 
"neighborhood take all" to a "neighborhood share all" concept. One certainly suspects that It is possible 
to incorporate U e Mf Cn as an unknown in the KSO approach , so that an extended KSO algorithm 

creates partitions of the data that are necessary, rather than, as in the current use of the HCM labeling 
rule, a heuristic afterthought. 

Major similarities between ISODATA and KSO include: 

(51) If we let (U F , v F ), (U H , v H ), (Ug. Vg), and (U K> v K ) denote, respectively, the pairs found by FCM, HCM, 
SHCM and KSO, we note that (Up, Vp) is a critical point for J m , while (U H , v H ), (Ug, Vg), and (U K> v K ) 
are, because of the HCM theorem, (possibly different) critical points of J 1 . However, (U H , v H ) * (Ug, v g ) 
* ( U K . v K ) generally. This suggests that (i) HCM (and especially SHCM) and KSO as described herein are 

most definitely related, and (ii), there should be a generalized (fuzzy) KSO that bears the same 
relationship to FCM that the hard c-Means versions bear to the current version of KSO. It seems clear 
that there is a stronger mathematical link between FCM/HCM and KSO than is currently known. 
Connection of the two approaches begins with careful formulation of a constrained optimization problem 
that holds for KSO. This involves finding a global KSO criterion function and necessary conditions that 
require the calculation of the weight vectors {Vj} as in KSO <3b>. 

(52) Both algorithms find prototypes (weights or cluster centers) in the data that provide a compressed 
representation of it, and enable nearest prototype classifier design. Recent work by Huntsberger and 
Ajjimarangsee [13] indicates that FCM is at least as good as KSO in terms of minimizing apparent error 
rates. And further, FCM sometimes generates identical solutions to KSO on various well known data 
sets. This is another powerful indicator of the underlying (unknown) relationship between the KSO and 
c-Means methods. Much can be done empirically to confirm or deny specific relationships between the 

two methods. 

We have itemized some similarities and differences between two approaches to the clustering of unlabeled 
data - Hard/Fuzzy c-Means and Kohonen’s self-organizing feature maps (KSO), and posed some questions 
concerning each method. Successful resolution of these questions will benefit both models. Numerical 
convergence properties and the neural-like behavior of both the extended KSO and FCM algorithms should 
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be established. Issues to be studied should include : robustness, adaptivity , parallelism , apparent error rates, 
time and space complexity, type and rate of convergence, optimality tests, and initialization sensitivity. 
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Abstract 

The VLSI implementation of a fuzzy logic inference mechanism allows the use of rule-based 
control and decision making in demanding real-time applications such as robot control and in the 
area of command and control. We have designed a full custom VLSI inference engine. The chip is 
fabricated using 1.0 p CMOS technology. The chip consists of 688,000 transistors of which 476,000 

are used for RAM memory. , . , . . . , . 

The fuzzy logic inference engine board system incorporates the custom designed integrated cir- 
cuit into a standard VMEbus environment. The Fuzzy Logic system board uses TTL logic parts 
to provide the interface between the Fuzzy chip and a standard, double height VMEbus backplane 
allowing the chip to perform application process control through the VMEbus host. High level C 
language functions hide details of the hardware system interface from the applications level pro- 
grammer. The first version of the board was installed on a robot at Oak Ridge National Laboratory 
in January of 1990. 


1 Introduction 

Fuzzy logic based control uses a rule-based expert system paradigm in the area of real-time process 
control [4]. It has been used successfully in numerous areas including train control [12], cement kiln 
control [2], robot navigation [6], and auto-focus camera [5]. In order to use this paradigm of a fuzzy 
rule-based controller in demanding real-time applications, the VLSI implementation of the inference 
mechanism has been an active research topic [1,11]. Potential applications of such a VLSI inference 
processor include real-time decision-making in the area of command and control [3], and control ot 

An original prototype experimental chip designed at AT&T Bell Labs [7] was the precursor to the 
fuzzy logic inference engine IC that is the heart of our hardware system. The current chip was designed 
at the University of North Carolina in cooperation with engineers at the Microelectronics Center of 
North Carolina (MCNC) [8]. MCNC fabricated and tested fully functional chips. 

The new architecture of the inference processor has the following important improvements compared 


159 


to previous work: 

1. programmable rule set memory 

2. on-chip fuzzifying operation by table lookup 

3. on-chip defuzzifying operation by centroid algorithm 

4. reconfigurable architecture 

5. RAM redundancy for higher yield 

The fuzzy chips are now incorporated in VMEbus circuit boards. One of the boards was designed 
for NASA Ames Research Center and another board was designed for Oak Ridge National Labora- 
tory (ORNL). The latter board has been installed and is currently performing navigational tasks on 
experimental autonomous robots [9]. 

ORNL will soon receive the second version of the board system featuring seven Fuzzy chips in a 
software reconfigurable interconnection network. The network provides host and inter-chip I/O in any 
logical configuration of the seven chips. 

2 Fuzzy Inference 

The inference mechanism implemented is based on the compositional rule of inference for approximate 
reasoning proposed by Zadeh [13]. Suppose we have two rules with two fuzzy clauses m the IF-part and 
one clause in the THEN-part: 

Rule 1: If (x is A*) and (y is B\ ) then (z is C\ ), 

Rule 2: If (x is A 2 ) and (y is B 2 ) then (z is C 2 ). 

We can combine the inference of the multiple rules by assuming the rules are connected by OR 
connective, that is Rule 1 OR Rule 2 [7]. Given fuzzy proposition (x is A ') and (y is B ), weights or, 
and af of clauses of premises are calculated by : 

a f = max(A', A*), 

X 

af = max(5', BA, for i = 1, 2. 
v 

Then, weights W\ and of the premises are calculated by : 

w\ = min(af , af ), 

W2 = min(a2 , )> 

Weight af represents the closeness of proposition (x is A») and proposition (x is A ). Weight u>i 
represents similar measure for the entire premise for the i th rule. The conclusion of each rule is 

C\ = Tn\n(wi } Ci) t for i = 1,2. 

The overall conclusion C 5 is obtained by 


C' = max(C{,C2). 

This inference process is shown in Figure 1. In this example, af = 0.5 and af =s 0 25, therefore 
w i = 0.25. af = 0.85 and af = 0.5, therefore w? = 0.5. 
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3 Fuzzy Chip 

The fuzzy logic inference engine is a fully custom designed 1.0 micron CMOS VLSI circuit of 688,000 
transistors implementing a fuzzy logic based rule system. Included on chip are a programmable rule set 
memory, an optional input fuzzifying operation by table lookup, a minimax paradigm fuzzy inference 
processor, and an optional output defuzzifying operation using a centroid algorithm. The standard data 
path configuration is shown in Figure 2. The design has a reconfigurable architecture implementing either 
50 rules with 4 inputs and 2 outputs, or 100 rules with 2 inputs and 1 output. Separately addressed 
status registers allow programmed control of the fuzzy inference processing and chip configuration. All 
the rules operate in parallel generating new outputs over 150,000 times per second. 

The chip has 12 bidirectional data pins and 7 address pins for rule memory I/O. For process-control 
I/O, each of 4 inputs and 2 outputs has 6 pins. Each of 4 inputs has a corresponding load pin. The 
chip also has several control signals. Control signals RW(read high write low) and CEN (chip enable) 
are similar to that of a memory chip. 

4 The System Boards 

4.1 Single Chip Systems 

The Fuzzy Logic system boards place the Fuzzy chip into a VMEbus environment to provide application 
process control through a VMEbus host. The single chip system designed for NASA Ames Research 
Center uses an off-the-self VMEbus prototyping board [10]. The overall configuration of the design is 
shown in Figure 3. In this design, the VMEbus interface is provided by the prototyping board system 
and needed a minimum of design for integration of the fuzzy chip. The fuzzy chip interface to the board 
is realized using discrete TTL parts and wire-wrapping. In the board system for ORNL, the VMEbus 
interface was designed by the first author and realized using a programmable logic device (PLD) and 
TTL parts. More robust printed circuit board (PCB) technology was used. The PCB architectural 
concept is shown in Figure 4. The UNIX device driver interfaces of these two boards are quite similar. 

The ORNL board is designed to standard VMEbus specifications for a 24 bit address, 16 bit data, 
slave module as found in The VMEbus Specification, Revision C.l, 1985. It provides digital communi- 
cation between the host and the Fuzzy chip. A large, UV erasable PLD generates the board control 
signals. VMEbus interface is through TTL parts. One Fuzzy Inference IC processes four 6-bit inputs to 
generate two 6-bit outputs. The interface with the host computer uses memory mapping to include the 
Fuzzy chip’s I/O addresses in the application process storage space. All of the chip’s memory as well as 
its inputs and outputs are accessed through addresses on the VMEbus so that the entire Fuzzy Logic 
board system responds like a section of memory. 

The board’s address space is 1024 bytes or 512 16-bit words in length. Most of the addresses in 
that space are not used by the board. The lower 128 word addresses of the board are mapped into 
the fuzzy chip. One hundred addresses are for rule memory. Another six addresses are mapped to four 
fuzzification tables and two status registers. The board has six addresses for I/O for the fuzzy chip, and 
addresses for hardware reset and board ID. On-board dip switches and signal jumpers allow the user 
to select the board base address comprised of the upper 14 bits of the 24 bit address, and the board s 
user privilege response characteristic determined by the VMEbus address modifier bits. Further design 
details are shown in Figure 5. 


4.2 Multiple Fuzzy Chip System 

The second version of the system board keeps the standard VMEbus interface of the first version but 
adds significant new capabilities. Seven Fuzzy chips communicate with each other and the host through 
a software reconfigurable interconnection network. Two Texas Instruments digital crossbar switch IC s 
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Figure 5: Details of PCB Architecture 


implement the network. Any logical configuration of the seven chips may be specified in software, e.g. 
seven in parallel, 4-2-1 binary tree, etc. Any fuzzy output may be routed to any input. With the new 
board more inputs may be processed and hierarchies of rule sets may be explored. We can simulate 
rules with up to 16 conditions in the IF-part by using three layers of Fuzzy chips. Another application 
is to load multiple rule sets for different tasks in a single board. This is done by configuring multiple 
chips in parallel. The new printed circuit board architectural concept is shown in Figure 6. 

This arrangement exploits an important feature of the Fuzzy chip. Normal input to the chip is by 
6-bit integers which the chip fuzzifies into 64-value membership functions to be fed into the processing 
pipeline. The final output membership function is defuzzified into a 6-bit output integer. However, the 
chip has another mode of operation. Any input or output can bypass the [dejfuzzification process so that 
I/O occurs in streaming mode. The full 64- value input or output membership function is placed on the 
pins, one value per clock cycle. When an output of one chip is connected to an input of another chip (or 
itself), communication can be done in streaming mode without the loss of information inherent in the 
[dejfuzzification operations. On this system board, all inter-chip communication is done in streaming 
mode. 

The new board also has four 64-value FIFO queues which allow final output to the host to be done 
in streaming mode. The application process is then free to perform its own custom operations on the 
full output membership functions. The final defuzzification is no longer limited to a centroid method. 
One can, also, generate the result in higher precision than 6 bits if necessary. 
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Figure 6: Seven Chip System Architecture. 


The new board will be installed at ORNL in August, 1990. In addition to navigational tasks the 
system will be used to explore fuzzy logic control of manipulator arm functions. 


5 Software Interface 

High level C language functions can hide the operational details of the board from the applications 
programmer. The programmer treats rule memories and fuzzification function memories as local program 
structures passed as parameters to the C functions. Similarly, local input variables pass values to the 
system and outputs return in local variable function parameters. Programmers are only required to 
know the library procedures. Some procedures provided for the version 1 board are described in the 

following table. 

1. WriteRule(rulenum, ruledata) - The rule data structure pointed to by ruledata is written to the 
board, 

2. ReadRule(rulenum, ruledata) - Reads back into ruledata the rule identified by rulenum currently 
stored in the chip. 

3. WriteFuzz(fuzznum, fuzzdata) - Fuzzification table is written to the board. 

4. StartFZIA C(inpA , inpB, inpC, inpD) - Four inputs are sent to the fuzzy board and inference 
processing will be started. 

5. ReadOut(outE, outF) - Both outputs are read from the board. Inference process will be continued. 
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6. StopFZlACfoutE, outF) - Both outputs are read from the board. Inference process will be halted. 

6 Summary 

We have described the architecture and associated high level software of two VME bus board systems 
based on a VLSI fuzzy logic chip. In addition to operating in the robot at ORNL, the single chip 
board is installed on a Sun-3 workstation at the University of North Carolina for further research and 
software development. For example, it is useful to provide an X-window based user interface to this 
fuzzy inference board. The complex and flexible architecture of the multiple chip board will require more 
sophisticated support software to facilitate exploration of various hierarchical interconnection schemes. 
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Whereas conventional fuzzy reasonings are associated with tuning problems which are 
lack of membership functions and inference rule designs, a neural network driven fuzzy 
reasoning (NDF) capable of determining membership functions by neural network is 
formulated. In the antecedent parts of the neural network driven fuzzy reasoning, the 
optimum membership function is determined by a neural network, while in the consequent 
parts an amount of control for each rule is determined by another plural neural networks. 
By introducing an algorithm of neural network driven fuzzy reasoning, inference rules for 
making a pendulum stand up from its lowest suspended point are determined for verifying 
the usefulness of the algorithm. 


1. INTRODUCTION 

Extensive applications of fuzzy 
reasoning for various control problems, and a 
number of actual examples of fuzzy control 
are reported [l] lately. However, the fuzzy 
reasoning is generally involved with a 
tuning problem [2], that is, the form of fuzzy 
number, and the fuzzy variables of 
antecedent parts and consequent parts of 
fuzzy inference rules, have to be adjusted 
for minimizing the difference between the 
estimation of fuzzy reasoning and the 
output data for given input data. 

As a method to solve the tuning 
problem, a neural network driven fuzzy 
reasoning (NDF) [3, 4] by which inference 
rules are constructed from the learning 
function of neural network [5, 6] is 
previously reported. The NDF is a type of 
fuzzy reasoning having an error back - 
propagation type network [7] which represent 
fuzzy sets in its antecedent parts , while 
another error back - propagation type 
network represents an input — output 
relationship between input and output data 
of consequent parts of each rule. 


In this paper, an algorithm for 
constructing inference rules based on NDF is 
introduced first, and an experimental 
verification of its effectiveness is performed 
taking an example for an inverted 
pendulum system. 

In this experiment, a pendulum in its 
hanged position is surely swang up and is 
held at an inverted position by using a 
mechanism controlled by inference rules 
which are constructed by determining fuzzy 
sets from the observations of pendulum 
operator by utilizing NDF algorithm. 

The inference period required for 
controlling the swing-up motion of pendulum 
is approximately 15 msec. As a parameter 
which governs the dynamic characteristics of 
inverted pendulum system, the length of 
pendulum is considered here, and changes of 
control characteristics of NDF caused by this 
is studied. Since the fuzzy set of antecedent 
parts and input - output relationship of 
consequent parts can be determined by 
means of NDF without finetuning of 
inference rules by utilizing the learning 
function of neural network acquired from 
the input-output data, it is an advantageous 
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method to solve tuning problems of fuzzy 
reasoning. 

2„ ARTIFICIAL NEURAL NETWORK DRIVEN 
FUZZY REASONING (NDF) 

The NN - driven fuzzy reasoning 
(NDF) is a fuzzy reasoning [8] using linear 
functions in its consequent parts. In a NDF, 
the membership functions in the antecedent 
parts is determined in a multi — dimensional 
space. For example, the following rules Rl, 
R2, and R3 of the conventional fuzzy 
reasoning wherein xl and x2 are input 
variables, yl, y2, and y3 are output 
variables, and alO and all are coefficients, 
and FSL and FBG are fuzzy numbers 
where SL and BG mean small and big 
respectively, are considered. 

Rl ; IF xl is FSL and x2 is FSL, 

THEN yl = alO + allxll + al2xl2 
R2 ; IF xl is FSL and x2 is FBG, 

THEN y2 = a20 + a21x21 + a22x22 (0 

R3 ; IF xl is FBG, 

THEN y3 = a30 + a31x31 

Since the above condition means that 
xl is small and x2 is small in the 
antecedent parts, the fuzzy sets Fl = FSL 
0 FSL can be constructed in a partial space 
of the input as shown in Fig. 1. The same 
can be applied for the fuzzy sets to be 
constructed for R2 and R3 likewise Since 
the boundary between the each partial space 
is vague the boundary is shown by the 
hatched lines. That means that the input 
space consisted of xl and x2 is divided into 
individual partial spaces by the number of 
fuzzy rules, and the fuzzy sets of 
antecedent part of each inference rules are 
constructed in each partial space, while the 
NDF is determined by the fuzzy sets of 
antecedent parts by utilizing the back — 
propagation type network. 

An explanation for the back - 
propagation type network is as follows. Since 
neural networks are constrained by a 


general type processing unit found in the 
neural system, and the processing unit in a 
neural network shares some of the physical 
properties of real neurons, the processing 
unit is called neuron here. 

Fig. 2 shows an example of 
fundamental layered back propagation type 
networks containing M layers, where the 
first layer is called an input layer, the Ki- 
th layers are output layers, and other layers 
are called intermediate layers. Every neuron 
within these layers represents respective 
correlation between the multi-inputs xij and 
multi -outputs yi expressed by the following 
equations. 

yi = f (Y. Ck ij xij + 9) (2) 

j=l 

1 

f(Z) = ©) 

1 + exp(-Z) 

where 9 is a weight showing a correlation 
between neurons. 

In this paper, for a given input and 
output expressed by x = (xl, x2, ... , xn) and 
y = (yi, y2, , yw) respectively, the input- 

output correlation of back -propagation type 
network as a whole is expressed by; 

y = NN(x) (4) 

The structure of model function NN 
(x) is characterized by M- layers [ulX u2X.. 

XuM] where ui , i = 1, 2 M are the 

numbers of neurons within the input, hidden 
and output layers respectively. Fig. 2 shows 
a structure of back propagation type 
network consisting of four — layers [3X2X2 
X2], 

Fundamental considerations made on 
the NDF are that the model equations yl, 
y2, and y3 in the consequent parts are 
identified as the non — linear Eq. (4) for 
obtaining the model equations. 

The fundamental consideration 
made on the membership functions in the 
antecedent parts is a method shown in Fig. 
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3 . 

If the relationships between rules Rl, 
R2, R3 and input data (xil, xi2) where i = 1, 
2, , N are considered, the first data xl are 
(xll, xl2) = (0.2, 0.15), and these data belong 
to rule Rl. Thus the data attribution to the 
rule can be expressed by (Rl, R2, R3) = (1, 
0, 0). The back - propagation type network 
three- layers [2X3X2] of which input and 
output layer are (Xll, xl2) and (Rl, R2, R3) 
respectively can be derived from the input- 
output data utilized in the learning process. 
However, the maximum number of learning 
is limited to be less than about 1000. 

When another data different from 
the input -output data are assigned to the 
neural network, the estimated values of 
back propagation type neural network are 
considered as membership values of fuzzy 
sets in the antecedent parts since the 
estimated value represents the attribution of 
data to each rule. A rule division performed 
by NDF is typified in Fig. 4 which shows 
non-linear divisions unlike the rectangular 
divisions shown in Fig. 2. 

Pao proposed a method for 
determining fuzzy sets by using a neural 
network [9], and obtained intersections and 
union sets of fuzzy sets. However, what he 
carried out were the determinations of 
intersection and union sets of fuzzy sets 
from the coupling patterns between each 
unit of neural network, and was not the 
type determining the shape of fuzzy sets 
from the input — output data such as 
excutable by NDF. 

In a NDF, the control rules are 
represented by an IF— THEN format shown 
below. 

Rs ; IF x = (xl, x2 xn) belongs to As, 

THEN ys = NNs(xl, x2, , xm) 

where s = 1, 2 r, m<n (5) 

The number of inference rules 
employed here is expressed by r, and As 
represents a fuzzy set in the input space 
area of antecedent parts. The degree of 
belongings of input x = (xl, x2 xn) to the 


s - th inference rule is defined to as the 
membership value of fuzzy sets As to the 
input x. Furthermore, the amount of 
operations ys of consequent parts is an 
estimated value for a case where a 

combination of input variables (xl, x2 

xm) is substituted in the input layer of 
back propagation type network, wherein the 
number of variables employed in this case is 
m according to a method for selecting the 
optimum model employing back- propagation 
type network. 

Although it is also possible to 
determine an overall non-linear relationship 
by using only one back - propagation type 
network, the determination of overall input- 
output relationship by applying back - 
propagation type network for each partial 
space is considered more advantageous than 
employing only one back - propagation type 
network for better clarification of overall 
non-linear relationship. 

In order to carry out an optimum 
model selection for the back - propagation 
type network of consequent parts, a stepwise 
method [10] by which a specified input 
variable derived from a combination of input 
variables is introduced and removed for 
obtaining a model which outputs an optimum 
estimated value, is available. 

In the present work, only an 
elimination of input variables from a 
combination of input variables by utilizing 
back -propagation type network is performed 
for deriving an optimum combination of 
input variables and model formula. A 
summation of the second powers of residuals 
is employed for evaluating and dtermining 
the input variables. 

An explanation for the algorithm of 
NDF is given in the following referring a 
block diagram of NDF shown in Fig. 5. The 
stepwise procedures taken for obtaining the 
inference rules and the control value yi * 
for the input data xi are as follows. 

Step 1: Selection of input variables, xl, x2 

xn, which are related to the control value y. 
This is for an assumed case where the input 
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-output variables (yi, xi) = (yi, xil, xi2, , 
xin) where i = 1, 2, , N, are obtained and 
the input data xij where j = 1, 2, ... , n, are 
the i-th data of input variable xj. 

Step 2 Division of input -output data into r 
classes of Rs where s = 1, 2, , r. As 

mentioned before, each partition is regarded 
as an inference rule Rs, and the input - 
output data for each Rs are expressed by (yi 

(s), xi(s)) where i = 1, 2 Ns providing that 

Ns is a number of input -output data for 
each Rs. 

Step 3: Decision of membership functions in 
the antecedent parts by using the neural 
network NNmem shown Fig. 5 providing that 
the structure of a back - propagation type 
network is a M -layered [nxu2x.„xuM-lx r]. 
The method for determining the form of 
membeiship functions is described previously. 

Step 4: Decision of control models in the 
consequent parts by using the neural 

networks NN1, NN2 NNr shown in Fig. 5 

providing that the structure of each back — 
propagation type network NNs is a M - 
layered [kxu2 x...xuM-lxl] where k = n, n- 
1, ._ , 1, and selections of optimum model for 
each NNs are performed. 

Consequently, the stepwise procedures 
for determining input variables by utilizing 
back - propagation type network, and the 
method for determining the structure of 
consequent parts are described in the 
following. 

Setting a condition at k = n, the 
input variables xi = (xil, xi2, ... , xik) where i 
= 1, 2, _. , N, are assigned for the input layer 
of each NNs, and the output variables yi is 
assigned for the output layer of each NNs, 
where the input variables assigned for the 
input layer and the output variables 
assigned for the output layer are 
respectively expressed by: 

s = {xl, x2 xk} (6) 

s = {y} CO 


where s represents a set of input variables 
assigned for the input layer of each back- 
propagation type network NNs, and s 
represents a set of output variables assigned 
for the output layer of NNs. 

An estimation eyi for the input data 
xil, xi2, _. , xik can be obtained after 
repeated learnings made on the back - 
propagation type network of NNs. However, 
the number of learnings is set at 
approximately 3000. Then the sum of mean 
squared errors of the output data yi and 
estimation eyi is calculated for obtaining an 
evaluation value ®ks required for 
determining the input variables. 

N 

®ks = (]T(yi _ ey0 2 )/N, s = 1, 2 r. (8) 

i=l 

In order to study the degree of 
correlation of the input variables xj to the 
output variables y, the input variable xj is 
temporarily removed from the set of input 
variables {xl, x2, ... , xk}. The input data 
from which the input variables xj is 

removed, xil,.., xij-1,.., xij+1 xjk where i 

= 1, 2. ... , N. are assigned to the input layer 
of M - layer of the back propagation type 
network [k - 1 u2 ... uM - 1 1], and the 

output data yi are assigned to the output 
layer. Then, the estimation eyi' for the input 
data xil. - , xij-1. ... . xij+1, can be obtained 
after the back - propagation learning. An 
evaluation value Ok - lsj required for 
determining the input variables is derived by 
calculating the sum of mean squared errors 
of the output data yi for this estimation eyi’. 


0 k— lsj = (£(yi -eyi'f),/N, s = 1, 2 ,.., r. (9) 
i=l 

The same calculations are conducted 
for the input variables other than xj for 
determining the evaluations ®k— Isl, ®k— ls2, 
... , ®k-lsj , ... , ®k-lsk. The calculation of 
evaluation which takes a minimum value. 
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©k-lsc, can be obtained by; 

®k-lsc - min@k— lsj, 

where j = 1, 2, , k. (10) 

Eq. 10 shows that the evaluation 0k - lsc 
obtained by removing the input variables xc 
from the set of input variables takes a 
minimum value among evaluations @k-lsl, 

®k-ls2, , ®k-lsj @k-lsk. By comparing 

the value of ®k-lsc of Eq. 10 to the value 
of ©ks of Eq. 8, the set of variables. As, is 
altered as follows. 

As = { xl. x2, ... , xc-1, xc+1, , xk} . 


If 

®k-lsc < 

0ks 

(ID 

{xl, x2, . 

If 

, xk} , 
®k-lsc 2 

®ks 

(12) 


When Eq. 11 is established, the sum 
of mean squared errors can be decreased by 
removing the input variables xc, and this 
means that the estimation eyi represents yi 
better than eyi. 

Therefore, the correlation of input 
variables xc to the output variables y is 
considered weak, and the input variables 
are removed from the input variable sets 
As. As a result of this, a set of newly 
established input variables is then consisted 
of k-1 input variables. 

On the other hand, the effectiveness 
obtained by removing input variables 
temporarily can not be attained when Eq. 12 
is established, and this fact means that the 
input variables xc are strongly correlated 
with the output variables y, and the number 
of sets of input variables As is left 
unchanged as k. 

In cases where the input variables 

can be reduced, k is altered to n-1, n-2 

1, and Step 4 is repeated until Eq. 12 can be 
established, and the procedures for reducing 
the input variables of back-propagation type 
network NNs are completed until Eq. 12 can 
be established. 

Thus, the back - propagation type 
network NNs having the final set of input 
variables, As = {xl, x2 xm} obtained at 


the time of procedure completion, becomes an 
optimum back - propagation type network 
representing the structure of consequent 
parts of rule Rs. The same step procedures 
are conducted for each NNs for determining 
the consequent parts of all the inference 
rules. This procedure to reduce the number 
of input variables is called a stepwise 
variable reduction method utilizing back — 
propagation type network. 

Step 5: The estimation yi* can be derived 
by the equation shown below. 

r 

XpAs (xil, xi2 xin)xmeyi(s) 

s=l 

yi*= — (13) 

£ jiAs (xil. xi2,~, xin) 
s=l 

i=l, 2, ... , N. 

where meyis is an estimation obtained by the 
optimum back — propagation type network 
derived by Step 4. 

Fig. 5 shows that the estimation yi* 
can be derived from the results obtained by 
conducting product operations between the 
membership values of antecedent parts of 

each inference rules, or /jAs(xil, xi2 

xin) and the estimation of consequent 
parts, or meyi (s), and by conducting 
summation operations between each rule 
continuously. However, Fig. 5 shows a 
case where a condition of jiAs(xil, xi2, ... , 
xin) = 1 is established. 

3. APPLICATION TO INVERTED PENDULUM 
SYSTEM 

The NDF proposed by the authors is 
capable of forming inference rules 
automatically, i. e., the function of self — 
autotuning, and proposed here is an inverted 
pendulum system to which a learning 
function by using a NDF is applied. In the 
algorithm employed for the experiment, four 
inputs and one output data are acquired by 
observing manual operating controls, and 
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fuzzy inference rules and membership 
functions are then automatically constructed 
from the acquired data by using the 
algorithm of NDF. 

Fig. 6 shows a structure of inverted 
pendulum system consisting of four elements 
explained in the following; 
t> Cart which runs on a rail. 

2) Pendulum rotatable freely around an axis 
of cart. 

3) Motor which drives the cart. 

4) Fixed pulleys and belt system which 
connect above three parts. 

The pendulum angle apart from the 
perpendicular 9 degree and the distance 
from the original position of cart are 
detected by the potentiometer b and a shown 
in Fig. 6 respectively. These are digitized by 
an AD converter, and the digitized signals 
are fed to a personal computer wherein the 
velocities of inverted pendulum angle and 
the cart distance are calculated from the 
differences in those obtained at every 
sampling. The output for the motor control 
system is then calculated from four 
variables, i. e., the pendulum angle, angular 
velocity, cart distance, and the cart velocity 
by using an algorithm of NDF. As the motor 
control signal is derived by a personal 
computer in a digital form, this is converted 
into an analog value through a DA 
converter. 

The inverted pendulum system has 
two control areas consisting of a linear - 
controlling area where the pendulum is 
standing, and a non-linear controlling area 
where the pendulum falls. The authors 
constructed an inverted pendulum system in 
the linear - controlling area by using a 
conventional fuzzy control, and a control 
model constructed in the non - linear 
controlling area by utilizing NDF, is 
reported here. 

The configuration of inverted 
pendulum system and the control computer 
are as follows. 

Body : Length of 1,410mm; width of 400mm, 
height of 880mm. 

Pendulum ; Length of 400mm , weight of 


40g , diameter of 4mm. 

Drive force : 25W DC motor with a gear ratio 
of 12.5 : 1. 

Sensors : Potentiometer to measure the 
distance from the original position of the 
cart, and another potentiometer to measure 
the pendulum angle. 

Micro-computer ; CPU 80286 
Program : C-language, 21K bytes. 

The preparation of control rules 
applicable to an inverted pendulum made 
according to an algorithm developed for 
constructing the inference rules by applying 
NDF is now described in the folowing. 

Step L Preparation of input - output data. 
This is acquired by an operator who tries to 
swing up a pendulum by moving the cart 
right or left direction on the rail by pressing 
either of corresponding controller buttons 
until the pendulum is brought to its inverted 
position, and the following input-output data 
with a sampling period of 4 msec are 
recorded; 

Output variable 
y ; Motor control signal (V). 
input variables 

xl : Distance from the original cart position. 
x2 : Velocity of xl (cm/ sec). 
x3 : Pendulum angle (deg). 
x4 ; Velocity of x3 (deg/ sec), 
where the input variables x2 and x4 are 
derived from the differences produced in xl 
and x3 values. Approximately 1,000 to 3,000 
data are acquired from these manual 
operations, and from these, 98 input-output 
data shown in Table 1 applied for the NDF 
are extracted. 

Step 2: Setting of two rules for the input - 
output data considering data distributions. 

Step 3: Determination of membership 
functions of antecedent parts. A three — 
layered [4X6X2] back - propagation type 
network employed here for determining the 
antecedent part construction is employed 
here, and the number of learnings is set at 
about 1000. 
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Step 4: Determination of consequent part 

structure. A three-layered [kx6xl] where k 
= 4. 3, 2, 1, back - propagation type network 
for determining the consequent part 
structure is employed here, and the number 
of learnings of each back -propagation type 
network is set at about 3000. 

By using a stepwise variable 
reduction method, we obtain; 

® 41 = 0.016 ft 4 ) 

@311= min@31j (= 0.007), j = 1,2,3, 4. (15) 

Therefore, 

@ 311 < @ 41 ft® 

Thus, by removing the input 
variables xl, we obtain As = {x2, x3, x4}. As 
for As = {x2, x3. x4}, a stepwise variable 
reduction method is applied again. By 
combining Eqs. 8, 9, and 10, we obtain the 
followings. 

@311= 0.007 ft® 

® 213 = min@21j (= 0.021), j = 2, 3, 4. (18) 

This means, 

@213 > @311 ft® 

Thus, no reduction of input variables 
is made, and the algorithm for Rule 1 is 
completed by the second calculation process. 
The inference rules consequently obtained 
by these are as follows: 

R1 ; IF x = (xl, x2, x3, x4) belongs to Al, 
THEN yl = NNl(x2, x3, x4), 

R2 ; IF x = (xl, x2, x3, x4) belongs to A2, 
THEN y2 = NN2(xl, x2, x4) (20) 

Photographs 1 and 2 show the swing- 
up motions of pendulum controlled by fuzzy 
inference rules expressed by Eq. (20). 
Photograph 1 shows sequential motions of 
pendulum swang from its stable equilibrium 
state to an inverted stand -still state. The 


estimation yi* can be derived from Eq. (13). 
The pendulum can be surely brought to its 
inverted position regardless the cart position 
on the rail, or a disturbance applied to the 
pendulum. Photograph 2 shows the controls 
of swing - up motion for various given 
pendulum angles. 

An experimental study for the 
limitation of control performed by NDF is 
conducted by changing the parameters which 
govern the dynamic characteristics of 
controlled object, and the length of pendulum 
is taken as a parameter governing the 
dynamic characteristics of pendulum here. 
The initial position of cart is set at the 
center position of belt on which the inverted 
pendulum device is mounted, and the 
pendulum angle is set at 0 degree when it is 
hanged down initially and ± 180 degree is 
specified when the pondulum is in an 
inverted position. The angle is incremented 
for its clockwise rotation, and decremented 
for its anti-clockwise rotation. 

The inference rules are constructed 
for a case where pendulum length is 40 cm, 
and Fig. 7 shows a response of pendulum of 
such. Figs. 8, 9, and 10 respectively show the 
responses of the 20, 30, and 50 cm long 
pendulums. The shifts of pendulum angle are 
shown by solid lines, and the changes of 
angular velocity are shown by broken lines 
in these figures. However, only the changes 
of pendulum angle and angular velocity 
until the pendulum comes to an inverted 
position, and no response after completion of 
inversion are shown there. 

As for the learning of inverted 
pendulum, the swing-up process of pendulum 
is learnt for constructing an inference rules 
applicable to the process of pendulum 
starting from the hanged down postion to a 
nearly inverted position. The inverted 
position is defined as a pendulum angle 
close to ±180 degree and its angular velocity 
nearly zero at that time. 

As shown in Fig. 7, the pendulum 
reaches at -180 degrees at 5.4 seconds after 
starting of control attaining an angular 
velocity of about 0 deg / sec, and the 
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pendulum stand still at an inverted position. 
This is rather natural consequence since an 
inference rules are established for a 40 cm 
long pendulum. 

In a case where the length of 
pendulum is set at 20 cm as shown in Fig. 
8. a large velocity change is observed, and 
the angle became 180 degrees at 6.2 seconds 
attaining an angular velocity of about 0 deg 
/ sec. Although the pendulum reaches at an 
inverted position and stays there, the 
angular velocity is larger and a longer lead- 
in period is required. 

Fig. 9 shows a transient response of 
a 30 cm long pendulum. The pendulum is 
brought to its inverted position showing a 
response similar to that obtained with the 40 
cm long pendulum, but the angle reaches at 
- 180 degrees at 3. 9 sec yielding a higher 
anlgular velocity which equals to about one 
half of that obtained with the 20 cm long 
pendulum. The overall controllable 
characteristics is silimar to that of 40 cm 
long pendulum. 

Fig. 10 shows a transient response 
obtained with a 50 cm long pendulum which 
was unable to brought to its inverted 
position. As seen in Fig. 10, the pendulum 
angle could not be brought to its ±180 degree 
position despite of longer lead-in period. The 
correlation between dynamic characteristics 
of pendulum and the variable length of 
pendulum can be summarized as follows. 

1) By applying a NDF to a pendulum system 
of which length is varied from 40 to 20 cm, 
a stable operation to bring the pendulum to 
its inverted position became feasible despite 
of lead-in period required for its motions. 
That is to say, the robustness of NDF is 
higher for the shorter length pendulum. 

2) for the cases of longer pendulums, 
however, the suppre^ion of deviations of the 
control system can not be attained, and this 
means that a relearning or additional 
learning is necessary for the NDF applied 
for a longer pendulum. 

4. CONCLUSION 


While the conventional fuzzy 
reasoning is associated with inherent tuning 
problems, NDF is, upon input - output 
variables are given, capable of determing an 
optimum inference rules and membership 
functions by utilizing its nonlinearity of 
back- propagation type network and learning 
capabilities. In order to verify the 
usefulness of NDF, it is applied to an 
experimentally constructed pendulum system 
wherein the pendulum is brought to its 
inverted position and stayed there starting 
from its stable hanged position. The length 
of pendulum is also altered for confirming 
itrs effects on the control characteristics of 
NDF. 

Since this method is capable of 
deriving an inference rules by using the 
learning function of back- propagation type 
network, the learning function can be 
introduced in the fuzzy control. The 
development of learning function adaptive to 
the changes of dynamic inference 
environment should be an important subject 
to be discussed in future. 
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Fig.l Conventional Fuzzy Partition of Rules 



Fig.2 Example of Neural Network 



Fig. 4 Proposed Fuzzy Partition of Rules 
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Fig.3 Decision of Membership Function in Antecedent Parts of Rules 

















Fig.9 Angle and Angular Velocity 
of 30 cm Long Pendulum 


Fig.10 Angle and Angular Velocity 
of 50 cm Long Pendulum 
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ABSTRACT 

Max-min fuzzy relational system can be regarded as a network of max 
and min operational elements. Thus the inverse problem of fuzzy relational 
equation is interpreted as an input estimation problem from output values in 
the corresponding network. An approximate network model of fuzzy relational 
system is proposed. An algorithm of obtaining an approximate solution of the 
system is presented by using a neural network technique. The availability is dis- 
cussed with a numerical experiment. 

Key words : fuzzy relation, fuzzy inverse problem, neural network, per- 
ceptron model 


Introduction 

Inverse problem of fuzzy relational equation (Fuzzy Inverse Problem) was proposed by 
E.Sanchez in 1976[1]. The solution of fuzzy inverse problem was shown by Tukamoto et al in 
1977[2]. And now, it is used for diagnosis of complicated systems. 

Max-min fuzzy relational system can be regarded as a network which consists of max and 
min operational elements. Thus the fuzzy inverse problem is interpreted as an input value esti- 
mation problem from output values in the corresponding network. 

In this view point, input of network can be identified when output and the network struc- 
ture are given. Assuming the network of fuzzy relational system can be approximately con- 
structed by the perceptron, we can regard the input of fuzzy system as the input of perceptron 
when output and the perceptron structure are given. 

In this paper, an input value estimation algorithm based on perceptron model is pro- 
posed, and it is applied to solving the fuzzy inverse problem. Numerical experiment is done to 
investigate the availability of this method, and the result of experiments is discussed. 

Input Estimation Algorithm of Perceptron Model 

Perceptron neural network model[3] is used in this paper. It is summarized as follows; 

a. Output value of j -th neuron in the k -th layer is denoted by 

«/• 

b. Threshold value of j -th neuron in the k -th layer is denoted by 

9f. 

c. Connection coefficient from the i -th neuron in (k-1) th layer to j -th one in k -th 
layer is denoted by 
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( 1 ) 

( 2 ) 


d. The relation of above three values are described by (l)-(3). 

«*=/(*/) 

i 

1 +exp(-s) 

Input estimation algorithm of perceptron model is described as follows. 
Preparation 

[1] Assume the perceptron model has n -input items and m -output items. 

[2] Define the evaluation function as 


where s= 


s \, s 2 t s 3, ' ’ ' 


£(s)=y£bv-«j(s )} 2 

and yjis the y-th output value of the perceptron. 


(3) 


(4) 


Algorithm 

[1] Set the initial input s as an arbitrary value. 

[2] Calculate output 

U= \u U U 2 ,U 3 , ■ ■ ■ «m 


[3] 


for the current input value s. 

Change the value s according to 

ds _ 3 E 

dt 5s 



dE ds' 

du‘ ds' r du'r 1 


3a'- 1 

- 3x 2 .. 

du 1 ./ 

ds'r 1 

du\. 

ds 1 

wwwm ^ 


where e is a positive value. 

[4] Repeat [2]-[3] until the value E attains a sufficiently small value. 

[5] Final value s is the estimated value by the perceptron. 


(5) 


Fuzzy Inverse Problem 

Fuzzy relation R between set X and set Y is regarded as a fuzzy set on the direct product 
of X and Y , and its membership function is denoted by 

p* J(xY = {(jc,y)lj:eX,yeK } (6). 

Assume that A is a fuzzy set on X and B is another fuzzy set on Y , where these 
membership functions are p^ and Pjj respectively, then fuzzy relation R satisfies (7) which 
means (8). 

B = R o A (7) 

HflOO^aXxex (Hj? UoOAPaC*)} 

If A and B are fuzzy input and output, respectively, then (7) is interpreted as an equa- 
tion of the system which has fuzzy input and output. 





Then the fuzzy inverse problem is the inverse problem of fuzzy relational equation, i. e. 
identifying A for the given B and R in the equation (7). 

Method of Solving Fuzzy Inverse Problem by using Perceptron Model 

This method is divided into two phases , i. e. the learning phase and the solving phase. 
The learning phase is summarized as follows; 

[1] Let A. and B. be the i -th input and output of fuzzy relation system R ( i - 1,2, ... 
M ) , respectively. 

[2] Encode A. and B. to x t and y ( .respectively, according to 

[0,l]3<J;l-w: I =2.2xa,- l.le [- 1.1, 1.1] (9) 

[0,l]3bil->y,=0.8xb,+0.lG[0.1,0.9] (10). 

[3] Let multi layer perceptron be learned by using input-output pair x ( and y f obtained 
at [2], 

After finishing this learning phase, we can get the solution as follows; 

[4] Let B be a fuzzy set to be solved in the fuzzy inverse problem. 

[5] Encode B to y by using eq. (10). 

[6] Apply the input estimation algorithm to the learned perceptron and estimate the 
input for given output y. 

[7] Decode y to A by using eq. (9), then we can get the solution of the problem. 


Numerical Experiment 

To investigate the availability of the method discussed in the previous section, a numeri- 
cal experiment on a digital computer has been done as follows; 

[1] Fuzzy relation, i. e. the solution in the fuzzy inverse problem, is given as (11) in this 
experiment. 


[ 2 ] 


/? = 


HrCTO'i) 

Hr CtDT) 

Hr (-TOT) 

* 

HrCTO'z) 

Hr UzO’z) 

HrCtO'z) 


Hr CTO's) 

Hr OTO's) 

... Hr CTO’s) 

= 

(•TO’ 4 ) 

Hr (x 2 ,y 4 ) 

- HrCtOU) 


Hr CTO's) 

Hr (JTO’s) 

- Hr CTO's). 



0.6 0.5 0.8 0.3 0.2 

0.4 0.1 0.9 0.6 0.4 

0.1 0.1 0.9 0.8 0.5 

0.9 0.2 0.9 0.1 0.5 

0.4 0.5 0.3 0.8 0.9 J 


( 11 ) 


[3] 


The learning data for the perceptron is generated as follows; 

Make fuzzy sets A, - A mf whose membership values at each element take all com- 
bination of values {0, 0.2, 0.4, 0.6, 0.8, 1.0} (c.f. 6 s = 7776). 

Operate each A. to the fuzzy relation R by using max -min composition, then we 
can get the fuzzy set B. 

Encode fuzzy sets A. and B. by eqs’ (9) and (10), then we will obtain the learning 
data x ( and y ( . 

Let’s move to the learning phase by using x ( and y ( as leaning data. The structure of per- 
ceptron used in this experiment is shown as in table 1. 


i) 

ii) 
«>i) 


table 1 : Outline of the Perceptron 

layer number 

1 

2 

3 

4 

number of neurons 

5 

10 

10 

5 


Error oacK propagation aigunuunr+j u uaw iv* o — -o r 

defined as follows : one learning process is a learning operation for a paired input-output 
data. The perceptron used in this experiment learns about 500 thousand times. The distri- 
bution of evaluation function (4) value for learning data is shown in figure 1. 
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[4] The test-data in the fuzzy inverse problem is made as follows; 

i) Make fuzzy sets A . whose membership values at each element are random values 
from [0,1] (where / = 1 - 1000 in this experiment). Get B. by compositing A. to 
the fuzzy relation R . 

[5] Encode B by eq. (10), and get y . Applying input estimation algorithm to the percep- 
tron, we can get estimated input value x* ( . Then we get the solution of fuzzy inverse 
problem by decoding x* ( to A*, by eq. (9). 

[6] Composite the obtained solution A*, to fuzzy relation R , and we get B*. The correct- 
ness of solution is considered as the closeness of membership value by each element 
between B*. and 

[7] Distribution of the values of evaluation functions (12)-( 14) for all test-data are shown in 
figure 2 - figure 4 . 


E mtm ^b* r bj\ 

(12) 

E m n-Max \b*j-bj\ 

(13) 

E mm =Min \b*j~bj\ 

(14) 

for all j , bj€B it b*jeB\ 



Discussion 

The availability of using perceptron for fuzzy inverse problem was shown trough a 
numerical experiment in previous chapter. The approximate solution of the fuzzy inverse prob- 
lem is obtained , but its precision is not enough. This is mainly because that the error of 
approximation of fuzzy relation by perceptron is not small enough. 

Distribution of approximation error of fuzzy relation was shown in figure 1 in the previ- 
ous chapter, but the inputs for error measuring are the same one as learning inputs, precisely. 
It is necessary to measure the error for no learning inputs. So the distribution of evaluation 
function (4) for no learning inputs is shown in figure 5 . By comparing error in learning and no 
learning cases, it should be noted that the distributions have the same shape. It depends on the 
generalization factor of neural networks. Hence it could be considered that the differences 
between learning and no learning are independent with the precision of solution. 

The distribution of figure 1 and figure 5 are bell shaped but not exponential. This is con- 
sidered that the approximation error is not less than a certain threshold value. This is because a 
confliction occurs between one learning input-output pair and others. Therefore, if we can 
increase the learning times or use more input-output pairs for learning, the improvement of 
precision for approximating fuzzy relation is not expected. 

Now let’s discuss how small the approximation error of fuzzy relation is. The measure- 
ment was done as follows; Let change membership value of fuzzy set A at one element in the 
interval [0,1], and compare the membership value of fuzzy set B obtained by compositing 
fuzzy relation R in eq. (11) and another one obtained from the output of perceptron. The 
membership values used for comparison are shown in table-2 and the results are shown in 
figure 6 - figure 10. In these figures, the horizontal axis represents membership value of a vari- 
able element of fuzzy set A , and vertical axes represent membership values of each element of 
fuzzy set B. There exist two lines in these graphs, where one (which has linear shape) 
represents the characteristics of approximated fuzzy relation, another (which has not linear 
shape) indicates the characteristic of perceptron which approximates the fuzzy relation. 
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table 2 : Conditions of membership value for fuzzy set A 


No. 

element number 

figure number 

1 

2 

3 

4 

5 

i 

[0,1] 

0.3 

0.5 

0.2 

0.7 

fig- 6 

2 

0.3 

[0,1] 

0.2 

0.6 

0.4 

fig- 7 

3 

0.2 

0.6 

[0,1] 

0.4 

0.7 

fig.8 

4 

0.8 

0.7 

0.2 

[0,1] 

0.5 

fig-9 

5 

0.2 

0.9 

0.3 

0.4 

[0,1] 

fig. 10 


Conclusion 

Input estimation algorithm of perceptron model has been proposed and applied to the 
fuzzy inverse problem. Numerical experiments have been done in order to get the solution of 
fuzzy inverse problem. The precision of approximate solution obtained by this method is dis- 
cussed, and the approximation error distribution is investigated. From the results of these 
numerical experiments, it is concluded that this method is available to obtain the approximate 
solution of fuzzy inverse problem. 
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Table 1. Outline of the Perceptron 


Outline of the Perceptron 

layer number 

1 

2 

3 

4 

number of neurons 

5 

10 

10 

5 


Table 2. Conditions of membership value for fuzzy set A 



Conditions of membership value for fuzzy set A 



element number 


figure number 

No, 

1 

2 

3 

4 

5 


1 


0.3 

0.5 

0.2 

0.7 

fig.6 

mm 

1 

[0,1] 

0.2 

0.6 

0.4 

fig-7 

tfl 

■ 

0.6 

[0,1] 

0.4 

0.7 

fig- 8 

m 

0.8 

0.7 

0.2 

[0,1] 

0.5 

fig.9 

KS 

0.2 

0.9 

0.3 


[0,1] 

fig. 10 
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Figure 2. Distribution of evaluation function E for meaning evaluation 

mean 
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Figure 3. Distribution of evaluation function E max for maximum evaluation 
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Figure 4. Distribution of evaluation function E . for minimum evaluation 

min 
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Figure 6. Comparison of fuzzy set B and output of perception (No.l) 
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Figure 7. Comparison of fuzzy set B and output of perceptron (No.2) 


195 




1.0 


membership 
value of t>5 


0.0 

1.0 


membership 
value of t>4 


0.0 

1.0 


membership 
value of t>3 


membership 
value of t >2 


ao 

.0 


membership 
value of bi 

ao 



ao as i.o 

membership value of a 


Figure 8. Comparison of fuzzy set B and output of perceptron (No.3) 
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Figure 9. Comparison of fuzzy set B and output of perceptron (No.4) 
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Figure 10. Comparison of fuzzy set B and output of perceptron (No.5) 
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ABSTRACT 

The use of fuzzy logic to model and manage uncertainty in a rule-based system places 
high computational demands on an inference engine. In an earlier paper, we introduced a 
trainable neural network structure for fuzzy logic. These networks can learn and extrapolate 
complex relationships between possibility distributions for the antecedents and consequents 
in the rules. In this paper, the power of these networks are further explored. The 
insensitivity of the output to noisy input distributions (which are likely if the clauses are 
generated from real data) is demonstrated as well as the ability of the networks to 
internalize multiple conjunctive clause and disjunctive clause rules. Since different rules 
(with same variables) can be encoded in a single network, this approach to fuzzy logic 
inference provides a natural mechanism for rule conflict resolution. 


1. INTRODUCTION. 

In dealing with automated decision making problems, and computer vision in 

particular, there is a growing need for modeling and managing uncertainty. Computer vision 

is beset with uncertainty of all types. A partial list of the causes of such uncertainty include: 

complexity of the problems, 
questions which are ill-posed, 
vagueness of class definitions, 
imprecisions in computations, 
noise of various sorts, 
ambiguity of representations, and 
problems in scene interpretation. 
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Rule-based approaches for handling these problems have gained popularity in recent years 
[1-6]. They offer a degree of flexibility not found in traditional approaches. The systems 
based on classical (crisp) logic need to incorporate, as an add-on, the processing of the 
uncertainty in the information. Methods to accomplish this include heuristic approaches [7, 
8], probability theory [9,10], Dempster-Shafer belief theory [4,5,11], and fuzzy set theory 
[5,6,12-14]. 

Fuzzy logic, on the other hand, is a natural mechanism for propagating uncertainty 
explicitly in a rule base. All propositions are modeled by possibility distributions over 
appropriate domains. For example, a computer vision system may have rules like 

IF the range is LONG, THEN 

the prescreener window size is SMALL; 
or 

IF the color is MOSTLY RED, THEN 

the steak is MEDIUM RARE is TRUE. 

Here, LONG, SMALL, MOSTLY RED and TRUE are modeled by fuzzy subsets over 
appropriate domains of discourse. The possibility distributions can be generated from 
various histograms of feature data extracted from images, fuzzification of values produced 
by pattern recognition algorithms, experts expressing (free form) opinions on some 
questions, or possibly generated by a neural network learning algorithm. 

The generality inherent in fuzzy logic comes at a price. Since all operations involve 
sets, rather than numbers, the amount of calculations per inference rises dramatically. Also, 
in a fuzzy logic system, generally more rules can be fired at any given instant. One 
approach to combat this computational load has been the development of special purpose 
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chips which perform particular versions of fuzzy inference [15]. Artificial Neural Networks 
offer the potential of parallel computation with high flexibility. In an earlier paper [16], we 
introduced a backpropagation neural network structure to implement fuzzy logic inference. 
In this paper we demonstrate further properties of that network. In particular, we show the 
insensitivity of the networks to noisy input distributions and to their ability to internalize 
rules with multiple conjunctive and disjunctive antecedent clauses. 

2. FUZZY LOGIC AND NEURAL NETWORKS. 

The original fuzzy inference mechanism extended the traditional modus ponens rule 
which states that from the propositions 
Pj: If X is A Then Y is B 
and P 2 : X is A, 

we can deduce Y is B. If proposition P 2 did not exactly match the antecedent of P 1? for 

example, X is A', then the modus ponens rule would not apply. However, in [17], Zadeh 

extended this rule if A, B, and A' are modeled by fuzzy sets, as suggested above. In this 

case, P x is characterized by a possibility distribution: 

Hun - R where 

p/M,v) - max {(l-p A («)), p*(v)}. 

It should be noted that this formula corresponds to the statement "not A or B", the 

logical translation of P x . An alternate translation of the rule Pj which corresponds more 

closely to multivalue logic is 

p/w,v) - min{l, {(1 - \i A (u)) + p/v)}},[17], 
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called the bounded sum. 

In either case, Zadeh now makes the inference Y is B' from ji R and p A . by 

M v ) - max{min{n |f (tt,v),n il , («)}}. 

This is called the compositional rule of inference. 

While this formulation of fuzzy inference directly extends modus ponens, it suffers 
from some problems [18,19]. In fact, if proposition P 2 is X is A, the resultant fuzzy set is 
not exactly the fuzzy set B. Several authors [18-20] have performed theoretical 
investigations into alternative formulations of fuzzy implications in an attempt to produce 
more intuitive results. 

In using fuzzy logic in real rule-based systems, the possibility distributions for the 
various clauses in the rule base are normally sampled at a fixed number of values over their 
respective domains of discourse, creating a vector representation for the possibility 
distribution. Table I shows the sampled versions of the "trapezoidal" possibility distributions, 
used in the simulation study, sampled at integer values over the domain [1,11]. Clearly , the 
sampling frequency has a direct effect on the faithfulness of the representation of the 
linguistic terms under consideration and also on the amount of calculation necessary to 
perform inference using a composition rule. For a single antecedent clause rule, the 
translation becomes a two dimensional matrix and the inference is equivalent to maxtrix- 
vector multiplication. As the number of antecedent clauses increases, the storage 
(multidimensional matrices) and the computation in the inference process grows 
exponentially. 
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Neural network structures offer a means of performing these computations in parallel 
with a compact representation. But the ability of such a network to generalize from an 
existing training set is the most valuable feature. In [16], we introduced the neural network 
architecture for fuzzy logic. Figure 1 displays a three layer feed-forward neural network 
which is used in fuzzy logic inference for conjunctive clause rules. It consisted of an input 
layer to receive the possibility distributions of the antecedent clauses, one hidden layer to 
internalize a representation of the relationships, and an output layer to produce the 
possibility distributions of the consequent. 

The input layer is not fully connected to the hidden layer. Instead, each antecedent 
clause has its own set of hidden neurons to learn the desired relationship. This partitioning 
of the hidden layer was done to ease the training burden for multiple clause rules, and to 
treat each input clause with its hidden units as a functional block. The training was 
performed using the standard back propagation technique [21]. 

3. EXPERIMENTS. 

The neural network architecture performed very well in generalizing the complex 
relationships between inputs and outputs. Table II (from [16]) shows the results of the 
training and testing of a network to implement the rule: IF X is LOW Then Y is HIGH; 
whereas Table III gives the situation for a rule with two conjunctive antecedent clauses. In 
both cases, the performance of the networks matched our intuitive expectation. 

Figure 2 shows typical responses of a neural network to noise in the input clause. It 
can be seen that the errors in the result are of the same order as the error in the input . If 
the networks are trained with fewer relationships, e.g. the traditional modus ponens 
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expectations, this error drops significantly. 

In order to implement rules with disjunctive antecedent clauses, networks with two 
hidden layers were necessary. Table IV displays training relationships for a two clause 
disjunctive rule. Note that there are 23 input/output triples necessary to enable the network 
to respond appropriately. The training, using backpropagation, of a single hidden layer 
network, of the type shown in figure 1, failed to converge on this complex training set. This 
caused us to investigate a two hidden layer structure where the first hidden layer was the 
same as in figure 1 and the second hidden layer contained 6 neurons totally connected to 
those of the first hidden layer and to the nodes of the output layer. This network converged 
in 4073 passes through the training set with a total-sum-of-squared error of less than 0.001 
for the entire training ensemble. We feel that this is a remarkable achievement, given the 
diversity of the responses to the antecedent possibility distributions which were necessary. 

This disjunctive structure was further tested with 18 input pairs of clauses including 
twelve pairs with varying amounts of additive gaussian noise. For this test set the average 
total-sum-of-squared-error per trial was 0.075. In other words, the match to the expected 
output in all cases was very good. 

As a final note, in [16] we demonstrated that a neural network structure of this type 
could encode multiple different rules which shared common antecedent clause variables. 
The packing of several rules into a single network has a surprising side benefit of providing 
a natural means of conflict resolution in fuzzy logic. 
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4. CONCLUSION. 

Fuzzy logic is a powerful tool for managing uncertainty in rule-based systems. Neural 
network architectures offer a means of relieving some of the computational burden inherent 
in fuzzy logic. Also, these structures can be trained to learn and extrapolate complex 
relationships between antecedents and consequents, they are relatively insensitive to noise 
in the inputs, and provide a natural mechanism for conflict resolution. 
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Figure 1. A three layer feed forward neural network for fuzzy logic 
inference 
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Rule: IF X is MEDIUM THEN Y is HIGH 


MEDIUM .00 

.00 

25 

.50 

.75 

1.0 

.75 

.50 

25 

.00 

.00 

INPUT .06 

.02 

.35 

.50 

.79 

1.0 

.72 

.54 

29 

.01 

.00 


TSS error = 0.020 



MEDIUM 


♦ INPUT 


HIGH .00 .00 .00 .00 .00 .00 20 .40 .60 .80 1.0 

OUTPUT .00 .00 .00 .00 .00 .00 .28 .48 .67 .84 1.0 

TSS error = 0.019 



HIGH 

OUTPUT 


Figure 2(a) Response of rule network to an input with small amount of additive gaussian 
noise. 
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MEDIUM .00 .00 25 .50 .75 1.0 .75 .50 25 .00 .00 

INPUT .00 .08 24 .52 .77 1.0 .64 .41 .43 .00 .00 


TSS error = 0.060 



MEDIUM 

INPUT 


o 

HIGH .00 
OUTPUT .00 


2 

4 

6 

8 

10 

12 



.00 

.00 

.00 .00 

.00 

.20 

.40 .60 

.80 

1.0 

.00 

.00 

.00 .00 

.00 

.25 

.46 .65 

.83 

1.0 


TSS error = 0.010 



HIGH 

OUTPUT 


Figure 2(b) Response of rule network to an input with a larger amount of additive 
gaussian noise. 
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Table I. The meaning of linguistic terms defined on the 

domain [1,11] and sampled at integer points. 


Label 

Membership 



LOW 

1.00 

0.67 

0.33 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

VERY LOW 

1.00 

0.45 

0.11 

0.00 

0.00 

0.00 

0.00 

0.00 



0.00 

MORL LOW 

1.00 

0.82 

0.57 

0.00 


0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

NOT LOW 

0.00 

0.33 

0.67 

1.00 

M 




1.00 

1.00 

1.00 

NOISY LOW (1) 

1.00 

0.70 

0.40 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

NOISY LOW (2) 

1.00 

0.70 

0.30 

0.00 

0.00 

0.00 

0.00 



0.00 

0.00 

NOISY MEDIUM 

0.00 

0.00 

0.30 

0.53 

0.81 





0.00 

0.00 

SHIFTED LOW 

1.00 

1.00 

1.00 

0.67 

0.33 

0.00 

0.00 

0.00 

0.00 


0.00 

MEDIUM 

0.00 

0.00 

0.25 

0.50 


mu 



0.25 

0.00 

0.00 

MORL MEDIUM 

0.00 

0.00 

0.50 

0.71 

0.87 

1.00 




0.00 

0.00 

NOT MEDIUM 

1.00 

1.00 

0.75 

0.50 

0.25 



0.50 


n 

1.00 

HIGH 

0.00 

0.00 

0.00 

0.00 





0.60 

0.80 

1.00 

VERY HIGH 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 



0.36 

0.64 

1.00 

MORL HIGH 

0.00 

0.00 

0.00 

0.00 

0.00 




mu 

0.89 

1.00 

UNKNOWN 

1.00 

1.00 

1.00 

1.00 

1.00 

1.00 

1.00 

1.00 



1.00 


MORL = more or less. 

Note: Very" A is determined by p v<ay .^(x) - 


MORL" A is determined by p M0RL .^(x) - [p il (x)] 1/ " +I 
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Table II. Performance of Fuzzy Logic Rule network with 8 hidden neurons for rule 
IF X is LOW THEN Y is HIGH. 


Tralnina Data* 

Input 

Output 

LOW 

HIGH 

VERY LOW 

VERY HIGH 

MORL LOW 

MORL HIGH 

NOT LOW 

UNKNOWN 


Training terminated when the total sum of 
squared error dropped below e = .001 


B. Testing Results 


Input 

Expected 

Output 

Actual Output 

Total Sum 
Squared 
Error 

VERY 2 LOW 

VERY 2 HIGH 

.00 

.00 

.00 

.00 

.00 

.00 

.03 

.10 

.27 

.56 

1.0 

.007 

MORL 2 LOW 

MORL 2 HIGH 

.00 

.01 

.01 

.01 

.00 

.01 

.56 

.71 

.82 

.91 

1.0 

.030 

MEDIUM 

UNKNOWN 

.99 

.99 

.99 

.99 

.99 

.99 

.99 

.99 

.99 

.99 

1.0 

.001 

VERY MEDIUM 

UNKNOWN 

.98 

.98 


.98 

.98 

.98 

.99 

.99 

.99 

.99 

1.0 

.003 

MORL MEDIUM 

UNKNOWN 

.99 

.99 

IS 

.99 

SI 

.99 

.99 

.99 

.99 

.99 

.99 

.001 

HIGH 

UNKNOWN 

.99 

.99 

.99 

.99 

PI 

.99 

.99 

.99 

.99 

.99 

.99 

.001 

NOISY LOW (1) 

HIGH 

.00 

m 

.00 

.00 

.00 

.00 

.26 

.47 

.66 

.83 

1.0 

.013 

NOISY LOW (2) 

HIGH 

.00 

EH 

.00 

.00 

.00 

.00 

.19 

.39 

.59 

.80 

1.0 

.0001 

SHIFTED LOW 


.09 

.09 

.12 

.09 

.09 

.09 

.91 

.92 

.94 

.97 

1.0 
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Table III. Performance of a two antecedent clause Fuzzy Logic Rule 
network with 16 hidden neurons (two groups of eight). 


A. Training Data* 


Input 

Output 

(LOW, MEDIUM) 

HIGH 

(VERY LOW, VERY MEDIUM) 

VERY HIGH 

(MORL L0W,M0RL MEDIUM) 

MORL HIGH 

(NOT LOW, MEDIUM) 

UNKNOWN 

(LOW, NOT MEDIUM) 

UNKNOWN 


* Training converged in 1823 iterations. 


B. 



Input 

Actual Output 

Closest 

Linguistic 

Term 

(NOISY L0W(1), MEDIUM) 

.00 

.00 

.00 

.00 

.00 

.00 

.20 

.40 

.60 

.80 

1.0 

HIGH 

(NOISY L0W(2), MEDIUM) 

.00 

.00 

.00 

.00 

.00 

.00 

.19 

.40 

.60 

.80 

1.0 

HIGH 

(VERY 2 LOW, MEDIUM) 

.00 

.00 

.00 

.00 

i _ . 

.00 

.00 

.19 

.38 

.60 

.80 

1.0 

HIGH 

(NOISY L0W(1), NOISY MEDIUM 

.00 

.00 

.00 



.00 

.20 

.41 



1.0 

HIGH 

(LOW, VERY 2 MEDIUM) 

.00 

.00 

.00 

.00 

.00 

.00 


D 


.64 

1.0 

VERY HIGH 

(VERY 2 LOW, VERY 2 MEDIUM) 

.01 

.01 

.01 

.01 

.01 

.01 

.03 

IS 

.29 

.58 

1.0 

VERY 2 HIGH 

(MORL 2 LOW, MORL 2 MEDIUM) 

.01 

.01 

.01 

.01 

.01 

Q 


.70 


.91 

1.0 

MORL 2 HIGH 

(NOT LOW, NOT MEDIUM) 

1.0 

1.0 

1.0 



BQ 

1.0 

1.0 

1.0 

1.0 

1.0 

UNKNOWN 

(LOW, SHIFTED MEDIUM) 

.97 

.97 

.97 

.97 

.97 

.97 

.99 

.99 

.99 

1.0 

1.0 

UNKNOWN 

(MEDIUM, LOW) 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

UNKNOWN 
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Table IV. Training Data for the two disjunctive clause rule: 
IF X is LOW QB Y is MEDIUM THEN Z is HIGH. 


Input 

Output 

(Very, MorL) LOW: * 

(Very, MorL) HIGH 

*; (Very, MorL) MEDIUM 

(Very, MorL) HIGH 

Not LOW; Not MEDIUM 

UNKNOWN 

MEDIUM; LOW 

UNKNOWN 

HIGH; LOW 

UNKNOWN 

HIGH; Very LOW 

UNKNOWN 

UNKNOWN, HIGH 

UNKNOWN 


* - LOW, MEDIUM, HIGH 


Training converged in 4073 iterations, with TSS 
error for entire training set less than 0.001 
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Abstract 

Even though the technology of neural nets has been successfully applied to im- 
age analysis, signal processing, and pattern recognition, most real world problems 
axe too complex to be solved purely by neural networks. Two important issues 
regarding the application of neural networks to complex problems are (1) the inte- 
gration of neural computing and symbolic reasoning, and (2) the monitoring and 
control of neural networks. Most hybrid models attempt to integrate neural net 
and symbolic processing technologies at the level of basic data representation and 
data manipulation mechanisms. However, intrinsic differences in the low-level data 
processing of the two technologies limit the effectiveness of that approach. This 
paper discusses the role of fuzzy logic in a hybrid architecture that combines the 
two technologies at a higher, functional level. Fuzzy inference rules are used to 
make plausible inference by combining symbolic information with soft data gener- 
ated by neural nets. Neural networks are viewed as modules that perform flexible 
classification from low-level sensor data. The symbolic system provides a global 
shared knowledge base for communications and a set of control tasks for object- 
oriented interface between neural network modules and the symbolic system. Fuzzy 
action rules are used to detect situations under which certain control tasks need to 
be invoked for neural network modules. The hybrid architecture, which supports 
communication and control across multiple cooperative neural nets through the use 
of fuzzy rules, enables the construction of modular, flexible, and extensible intelli- 
gent systems, reduces the effort for developing and maintaining such systems, and 
facilitates their application to complex real world problems that need to perform 
low-level data classification as well a s high-level problem solving in the presence of 
uncertainty and incomplete information. 
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1 Introduction 


Recent development of neural network technology has demonstrated many promising 
applications in the areas of pattern recognition, image processing, and speech recognition. 
However, most real world problems are too complex to be solved purely by current neural 
network technologies. This paper addresses two important issues regarding building 
complex intelligent computer systems based on neural networks. 

1. How to integrate neural computing with symbolic reasoning ? 

A complex application usually can benefit from a synergistic integration of neural 
computing and symbolic reasoning. For example, in anti-submarine warfare, one 
might like to combine signal processing results computed in a neural net with sym- 
bolic analyses of evidence such as database information (e.g., records of confirmed 
vessel departures from port) and extended inference procedures (e.g., hypotheses 
about plausible mission plans). Many other problems, ranging from speech and 
vision to space applications, share this property of needing synergy between neural 
nets and symbolic approaches. 

2. How to monitor and control the behavior of neural networks? 

If one wishes to construct a real world application such as anti-submarine warfare 
using neural networks, it is crucial to have mechanisms for interpreting and react- 
ing to the results produced by the neural nets, so that the overall system can cope 
with the rapidly changing and unanticipated situations. For example, after being 
activated by an input pattern, a bidirectional associative memory, or BAM[11], 
might converge to a pattern not belonging to the set of training patterns. This 
misclassification phenomenon can be caused by having overly similar or numer- 
ous training patterns. In either case, the BAM needs to be modified (i.e., certain 
training patterns need to be removed from the training set) to improve its perfor- 
mance. Therefore, the system needs a controller that oversees the behavior of the 
neural networks. A general mechanism that supports the control across multiple 
cooperative neural nets will enable the construction of modular, flexible, and ex- 
tensible neural net systems, reduce the effort for developing and maintaining such 
systems, and facilitate their application to complex real world problems. The need 
of a higher-level system for evaluating the performance of neural networks has also 
been suggested by other researchers [14]. 

This paper discusses the role of fuzzy logic in integrating neural networks and sym- 
bolic systems and in supervising the behavior of neural networks. To do this, we propose 
a hybrid architecture that uses fuzzy logic to combine the two technologies at a higher, 
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functional level. Two types of fuzzy rules are supported by the architecture: fuzzy infer- 
ence rules and fuzzy action rules. Fuzzy inference rules are used to assimilate the outputs 
of neural nets, which are often soft data [24], into the symbolic system. Fuzzy action 
rules are used to issue control tasks, which are implemented by methods in object-oriented 
programming, for activating, training, and modifying neural nets. Neural networks are 
viewed as modules that perform flexible classification. The symbolic system provides a 
global shared knowledge base for communications and a fuzzy rule interpreter for per- 
forming rule-based reasoning. 

Most hybrid models attempt to integrate neural net and symbolic processing tech- 
nologies at the level of basic data representation and data manipulation mechanisms. 
However, intrinsic differences in the low-level data processing of the two technologies 
limit the effectiveness of that approach. In contrast, our approach combines the two 
technologies at a higher, functional level. The symbolic system views neural networks 
as modules that (1) extend its reasoning capabilities into flexible classification and data 
associations, and (2) extend its learning capabilities into adaptive learning. Neural nets 
each view the symbolic system as providing a global shared memory for communications 
and a controller, built using fuzzy action rules, for activating, training, and monitoring 
them. Fuzzy inference rules are used to pass data between the two subsystems; and fuzzy 
action rules are used to pass action between the two. 

The key features of the proposed architecture that will provide these desirable prop- 
erties include the following: 

1. Fuzzy rules can invoke neural nets for testing “soft” (fuzzy) conditions in their 
left-hand-sides. 

2. Recognition of situations requiring actions on neural networks is accomplished via 
fuzzy action rules, whose actions are modified by the degree that the rules’ condi- 
tions are matched. 

3. Both high-level descriptions (e.g., input-output characterizations) and the behavior 
(e.g., performance evaluations) of neural networks will be modeled using a princi- 
pled frame-based language. 

4. The symbolic system will interact with neural nets through a set of generic func- 
tions called control tasks. Control tasks will be implemented using methods in 
object-oriented programming so that common methods can be shared, and specific 
methods can override general ones. 

In the following sections, we first discuss the background of this work, then we describe 
the hybrid architecture with an emphasis on the features mentioned above. Finally, we 
summarize the benefits of our approach. 
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2 Background 


2.1 Two Complementary Technologies: Neural Networks and 
Artificial Intelligence 

Neural networks and symbolic reasoning are two complementary approaches for achieving 
the same goal: building autonomous intelligent systems. The major strengths of Neural 
Networks are their capabilities for performing flexible classification and adaptive learning. 
By automatically capturing similarities among training instances (i.e., adaptive learning), 
neural networks are often able to perform flexible classification. That is, when given input 
data which is similar, but not identical, to inputs upon which the system has been trained, 
the network generates output similar to the trained responses. Consequently, a trained 
neural network is able to classify data approximately even when that data is incomplete 
or noisy. Thus, while most AI systems cannot tolerate such data, neural networks promise 
a system whose performance gracefully degrades under those circumstances. 

On the other hand, neural networks have several major weaknesses. They have trouble 
handling multiple instances of the same concept. Viewed as a pattern-matcher, they have 
trouble dealing with patterns containing variables. They tend to be specialized for a 
specific task. Solving complex tasks is likely to require cooperation between many neural 
networks, but managing their intercommunication is not well-understood. Control of 
the activation and learning behavior of these networks by higher-level modules is also 
not well-understood. Because their internal representation is in a form that cannot be 
comprehended by the user easily, it is hard to explain the rationale behind the output 
of neural networks. Although some of these problems have been addressed by neural 
network researchers (e.g., schema theory[2] addresses the first two issues), a neural net 
approach that addresses all these problems is yet to be developed. The goal of this 
research is to develop a comprehensive solution to these concerns using fuzzy logic and 
existing AI techniques. 

Certain AI techniques suggest solutions to the problems illustrated above. Different 
instances of a concept are easily represented using frame-based knowledge representa- 
tion systems. Variables often occur in patterns, which can be matched with data using 
a pattern matching facility. The notion of supporting many independent modules that 
communicate through a global knowledge base accessible to all modules is an idea central 
to many AI systems. For example, blackboard architectures maintain a data structure 
(the “blackboard”) where all knowledge sources can post or retrieve information. Produc- 
tion system architectures also have a working memory that all productions match their 
conditions against and act upon. An AI system may also provide a higher-level con- 
troller, often called the meta-level architecture, that has knowledge about the lower-level 
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system and is able to control the lower-level system in various ways. The explanation 
capabilities of AI systems have been enhanced by explicitly representing problem solving 
strategies [15]. 

Our integration of AI capabilities with neural nets is designed to address these issues. 
In Section 2.2, we explain the concerns driving the design. In Section 3, we detail our 
approach. 


2.2 Problems with Current Hybrid Approaches 

Combining neural networks and AI is certainly not a new idea, but previous efforts have 
not addressed the important issues raised above. A number of researchers have used 
neural networks to reimplement AI techniques such as production systems and semantic 
networks [19, 7]. Work in this area mainly demonstrates what neural networks can 
do, not that their implementations are better than the conventional ones. Others have 
applied neural networks to expert systems, natural language understanding, and other 
areas that have mainly utilized conventional AI techniques [9]. Work in these first two 
categories applies current neural net technologies, rather than addressing weaknesses of 
neural nets. Furthermore, it has demonstrated neural net implementations of things that 
AI can easily handle, rather than things that AI has great difficulties in doing (e.g., 
partial matching). A few researchers have introduced ideas from neural networks into 
conventional AI techniques or architectures. For example, Anderson’s ACT * architecture 
incorporates the notion of “activation values” into the memory structure and the rule 
base of a production system architecture [1]. Although such hybrid models do attempt to 
augment the weaknesses of AI, they do not attempt to address issues regarding multiple 
neural nets because there are no neural net modules in these connectionist models at 
all. Finally, some efforts have introduced ideas from AI into neural nets. Network 
regions, for instance, impose hierarchical structures from frame-based systems onto neural 
networks[6]. Although concerned with the weakness of neural nets, these efforts have not 
been able to overcome the two technologies’ intrinsic differences in data representation 
and data manipulation mechanisms. 

In neural networks, data are represented in a distributed fashion within dynamic 
networks and data manipulation involves numeric computations. In artificial intelligence, 
each conceptual entity is represented as a unit composed of symbols and pointers to 
other units, and data manipulation involves logical deduction and pattern matching. 
Our approach to this mismatch of representations is to integrate AI, not with these 
basic mechanisms of neural networks, but rather with their high-level functions: i.e., 
classification and data association. These refer to the capability of a neural net to take 
an input pattern and either classify it with respect to some set of classes, or generate an 
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output pattern most closely associated with the input pattern. Viewed at this functional 
level, these capabilities are closely related to pattern matching and automated reasoning 
functions in symbolic systems. 

Based on these observations, we will describe a novel hybrid architecture that allevi- 
ates the difficulties encountered by current hybrid models through the use of fuzzy logic 
in integrating the two paradigms at their functional levels. The architecture provides an 
extremely high degree of synergy between the approaches, along precisely the dimensions 
required to facilitate ease of programming and enable scaling-up to larger problems. 

2.3 Fuzzy Logic and Neural Networks 

Several techniques for integrating fuzzy logic and neural networks have been suggested. 
For instance, neural nets have been suggested for learning the membership functions of 
a fuzzy set [16]. The learning techniques in neural nets have been applied to learning 
fuzzy control rules [12]. Finally, fuzzy cognitive map suggests an approach for capturing 
fuzzy knowledge within the framework of associative memories [10]. Our discussion here 
will be focused on the roles of fuzzy logic in integrating multiple neural networks and 
knowledge- based systems and in monitoring the performance of neural networks. 


3 A Hybrid Architecture 

A high-level block diagram of the proposed hybrid architecture is shown in Figure 1. The 
architecture has four major components: (1) a set of neural net modules, (2) a symbolic 
system consisting of a global knowledge base, (3) a fuzzy rule system that supports fuzzy 
inference rules and fuzzy action rules, (4) and an object-oriented interface between the 
symbolic system and the neural nets. The neural nets process data obtained either from 
external sensor devices or from the knowledge base of the symbolic system. The global 
knowledge base consists of a fuzzy database and a neural-network taxonomy that describes 
meta-level knowledge about the neural nets themselves. The fuzzy database stores data 
and hypotheses that can be uncertain, imprecise, or vague. The neural-net taxonomy 
consists of neural-net classes, (shown as circles in Figure 1) and individual neural-net 
objects that form the leaves of the taxonomy (shown as rectangles). For instance, the 
neural-net object BAM\ belongs to the neural net class BAM (Bidirectional Associative 
Memory), and inherits all the general properties (e.g., its training procedure and its 
activation process) of the BAM class. There is one neural-net object for each neural 
net module. The fuzzy rule base consists of two types of rules: fuzzy inference rules and 
fuzzy action rules. Fuzzy inference rules make plausible inferences by combining symbolic 
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information with the outputs of neural networks. Control tasks can be invoked either by 
procedure calls or by fuzzy action rules to effect activation, learning, and modification of 
neural networks. These control tasks Me performed by selecting and executing methods 
that are inherited through the neural network taxonomy. 

The hybrid architecture is an extension of CLASP [23], an advanced AI programming 
environment that fuses the best aspects of frames, rules, and object-oriented program- 
ming. In the following sections, we discuss four major technical issues of the proposed 
hybrid architecture: 

1. Using fuzzy inference rules to combine the output of multiple neural networks with 
symbolic information; 

2. Modeling meta-level knowledge about neural networks in a symbolic knowledge 
base; 

3. Using a set of control tasks , which are implemented by methods in object-oriented 
programming, to define the interface between symbolic systems and neural nets; 

4. Using fuzzy action rules to recognize situations necessitating actions upon neural 
networks. 

Throughout the following discussion, we will use a sensor fusion system for anti-submarine 
warfare as an example to illustrate our approach. This hypothetical system consists 
of multiple neural nets for classifying various kinds of sensor input and for integrating 
various information about submarines, along with a symbolic expert system for analyzing 
the findings and planning anti-submarine strategies. 


3.1 Fuzzy Inference Rules 

We use fuzzy inference rules to assimilate the outputs of neural networks into the symbolic 
system, because neural networks often generate classification results that are imprecise 
in nature. For instance, a neural network that determines the hostility classification of a 
submarine could generate a qualitative measure of hostility (e.g., hostility degree is 0.7), 
or a membership values of several fuzzy sets (e.g., membership value of very-hostile is 
0.6, membership value of hostile is 0.8, ... ). 

A fuzzy inference rule checks certain soft conditions, than make a plausible conclu- 
sion based on the degree those conditions are satisfied. The condition side of a fuzzy 
rule consists of fuzzy conditions as well as non-fuzzy condition. A fuzzy condition can 
be checked by invoking a neural net module in a data-driven fashion (i.e., the neural net 
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If Source was lost due to fade-out in the NEAR-PAST, and 
Similar source started up in an another frequency , and 
Locations of the sources are relatively CLOSE 
Then 

The possibility that they are the same Source is MEDIUM. 


Figure 2: An Example of Fuzzy Inference Rules and Data-driven Neural Nets 


If Report exists for a vessel class Rose to be in the vicinity, and 
Source likely to be associated with Rose has been detected, 
TheSxpect to find other Source types associated with Rose class. 


Figure 3: An Example of Fuzzy Inference Rules 

is activated by the arrival of data). From the symbolic system’s point of view, neural 
net modules act as predicates in a fuzzy rule’s condition side that check a “soft” (fuzzy) 
condition and return a number between zero and one indicating the degree of matching 
(e.g., the membership value of a fuzzy set). Figure 2 shows an example of fuzzy infer- 
ence rule 1 where source refers to some noise-producing objects, such as propellers and 
shafts on ships. Fuzzy sets in the rules are expressed in uppercase. Suppose a neural 
net N Ni classifies sensor data from hydrophones into possible sources of the noise. The 
fuzzy inference rule will combine the output of the neural net with other symbolic infor- 
mation (e.g., the reason a source was lost, the location of the sources) to determine the 
applicability of the rule. 

In addition to use the output of a neural net in a data-driven fashion, a fuzzy inference 
rule can also invoke a neural net in a goal-driven fashion. For instance, the fuzzy inference 
rule in Figure 3 creates an expectation about the existence of certain source types. This 
expectation can be verified by several neural net modules that classifies noise sources 
associated with Rose class vessel. 

i The examples in Figures 2 and 3 are two rules in HASP, a Blackboard system that analyzes sensor 
data from hydrophone arrays for ocean surveillance mission [8]. 
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3.2 Modeling Meta- level Knowledge about Neural Networks 


For a symbolic system to control neural nets and to use them as modules that extend its 
reasoning capabilities, it needs some information about the performance and the func- 
tional behaviors (e.g., input/output descriptions) of the neural nets. Such information 
is particularly crucial for integrating neural nets and symbolic systems, as they can not 
easily communicate with each other otherwise. Our approach is to symbolically repre- 
sent information about classes of neural networks and individual neural networks, using 
a principled frame-based knowledge representation mechanism, called term subsumption 
languages[\l]. Doing so offers three important advantages. 

1. The model describes the functional behavior of neural networks in a way that helps 
the symbolic system invoke neural nets to extend its capabilities. For instance, an 
input/output description of a neural net allows the symbolic expert system to tell 
when a question it is working on can be answered by activating a particular neural 
net. 

2. It provides the basic structure for our method inheritance mechanism (see Section 
3.3). This allows general methods and specific methods to be described at their 
appropriate abstraction level, which facilitates the sharing of common methods and 
a saving of effort in developing and modifying them. 

3. Finally, this approach enables the symbolic system to reason about the behavior 
of neural networks using automatic classification reasoning capabilities of term 
subsumption systems[18], which extend the system’s knowledge about neural nets 
beyond what’s stated explicitly in the model. 

Figure 4 shows an example of meta-level knowledge that might be kept about a neu- 
ral net for classifying the hostility of a submarine based on its location, speed, direction 
of movement, and depth. Several attributes need explanation. Reliability is the cu- 
mulative performance measure of the neural net, while performance-measure records 
the performance of the neural net’s last activation. The reliability-threshold is the 
minimum reliability of the neural network that the system can tolerate. A neural net 
needs to be modified when its reliability is below its threshold value. 

CLASP provides a rich term subsumption language, LOOM [13], for modeling meta- 
level knowledge about neural nets. Term Subsumption Languages are knowledge repre- 
sentation formalisms that employ a formal language, with a formal semantics, for the 
definition of terms (more commonly referred to as concept or classes), and that deduce 
whether one term subsumes (is more general that) another [17]. These formalisms gen- 
erally descend from the ideas presented in KL-ONE [5]. Term subsumption languages 
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Name : BAC K i 

Type: Three-layer-feedforward 

Learning: Back-propagation 

Input: Location, speed, direction, depth 

Output: Hostility 

Training- status: Trained 

Perf ormance -measure : Satisfactory 

Reliability: 0.9 

Reliability-threshold: 0.7 


Figure 4: Meta-level Knowledge about a Neural Net 

are a generalization of both semantic networks and frames because the languages have 
well-defined semantics, which is often missing from frames and semantic networks [20, 4]. 
The major benefit of using a term subsumption language (e.g., LOOM) to model the 
neural nets lies in its strong support for developing a consistent and coherent class tax- 
onomy. This can be illustrated by the following example. Suppose the model defines 
that (1) a possible-spurious-recognition-net is any noise-sensitive-net which has 
two examplars that differ in less than two pixels; and, (2) CG\ is a neural net module of 
type Carpenter-Grossberg-net, which is a kind of noise-sensitive-net. If CG\ has 
two examplars that differ only in one pixel, LOOM will infer that CG\ is a possible- 
spurious-recognition-net. Thus, using a term subsumption language to model the 
neural net taxonomy improves the consistency of the taxonomy, avoids redundancy in 
the model, and minimizes human errors introduced into the meta-level knowledge base. 

3.3 Control Tasks and Methods 

To link a symbolic system and neural net modules, a hybrid system needs to define a set 
of functions that interface between them. These functions facilitate the construction of a 
layered hybrid system by serving as the intermediate layer between the symbolic system 
and the neural nets. This layered approach means that hybrid systems will be built in a 
flexible and extensible way because we can extend the intermediate layer with minimum 
modification to the symbolic system and the neural nets. 

Our approach to building the intermediate level has two major aspects. First, we 
use a set of generic functions (called control tasks) to define the interaction between the 
symbolic system and the neural networks. Second, we use methods in object-oriented 


227 



Figure 5: An example of control task decomposition 
programming to implement control tasks. 

Conceptually, we can view control tasks as messages sent back and forth between 
symbolic systems and neural networks. Symbolic systems use control tasks to activate 
and modify neural network modules; these, in turn, use control tasks to inform the sym- 
bolic system about their input/output behaviors. For example, the symbolic system 
would send an activate-net message to a neural network object in order to activate its 
corresponding neural network module 2 . Conversely, the neural network module would 
send a set-performance- measure message to the neural net object in order to up- 
date the neural net’s performance-measure (possibly causing monitoring rules to be 
triggered). Some of the basic control tasks supported by the architecture may include: 
activate-net, train-net, set-training-status, set-performance-measure, update- 
reliability, and remove-training-pattern. 

Our approach increases the reusability of modules and reduces the cost of developing 
and maintaining the system in two ways. First, it separates the purpose of a task from 
its implementation. Using control tasks to indicate “what needs to be done” allows the 
symbolic system and the neural nets to interact at an abstraction level that is indepen- 
dent of their detailed implementations. Second, our approach facilitates decomposing 
tasks into subtasks that cam be shared by multiple neural nets. For example, the con- 
trol task activate-net can be further decomposed into five subtasks as shown in Figure 
5. By decomposing control tasks into subtasks, which are functioned modules, we sep- 

2 A neural network module can also be activated by the arrival of sensor data 
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arate application-specific modules (such as encode-input and decode-output 3 ) from 
application-independent modules (such as activate-input). 

CLASP offers a mechanism for defining generic functions (also called operators) that 
can be invoked by rules or by function calls in any program [22]. CLASP’s capability to 
invoke generic functions by rules and by procedural call is important because it allows the 
symbolic system to invoke control tasks by rule triggering, and the neural net modules 
to initiate control tasks through procedural invocations. 

Control tasks will be implemented using method inheritance mechanisms in CLASP’s 
object-oriented programming capabilities 4 . The methods implementing control tasks are 
attached to the neural net objects, which are organized into a taxonomy. An individual 
neural net inherit all its methods from its parents in the taxonomy. To implement a 
control task for a neural net N, the architecture finds a method for the task that is 
inherited from the most specific parent of N. This approach increases the reusability of 
methods, and avoids redundancy in defining similar methods. For example, although 
different bidirectional associative memories (BAM’s) may differ in how they encode and 
decode symbolic information, they could all share the same activate-input method. 


3.4 Fuzzy Action Rules 

In addition to storing meta-level information about neural nets and specifying possible 
control actions on a neural net, the symbolic system needs a mechanism for recognizing 
situations within neural nets that indicate a need for action. Even though production 
systems in artificial intelligence offers such a capability, they do not address the issue 
of partial matching (accepting an approximate fit between observed data and a rule’s 
condition). A production system that takes into account the degree of partial matching 
will enable the system to respond in a flexible way even in the face of incomplete or noisy 
data. 

Our approach is to use fuzzy action rules, a generalization of production rules, to issue 
control task to neural net modules. A fuzzy action rule can use the degree its condition 
is satisfied to adjust its action 5 . Depending on the partial matching result, a fuzzy action 
rule may or may not be deemed applicable. For example, a rule may be viewed applicable 

3 In our terminology, encoding refers to transforming raw sensor data or symbolic information into 
neural net representations, and decoding refers to transforming neural net representations back into 
symbolic form. 

4 Actually, the method-dispatching mechanism in CLASP is more general than those in object-oriented 
programming languages (e.g., SMALLTALK-80) in that it allows programmers to describe more complex 
situations in which a method applies [21]. 

5 The partial matching results of fuzzy productions can also be used for conflict resolution [3]. 
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If neural net N is a kind of bidirectional associative memory, and 
its classification results are UNSATISFACTORY, 

Then decrease its reliability SLIGHTLY. 

If the reliability of a neural net is VERY LOW, 

Then set a goal to diagnose and fix the neural net and initialize 

the priority of the goal to be proportional to the degree of matching. 


Figure 6: Two Rules that Monitor the Performance of a Neural Net 

only if the degree of matching is greater than a threshold value. 

To illustrate how we use fuzzy action rules to control activation, training, and per- 
formance of neural nets, two monitoring rules (paraphrased into English) are shown in 
Figure 6. They monitor neural net modules by updating and acting on the modules’ 
performance measures. The first rule illustrates how our neural net taxonomy allows 
rules to apply over whole classes of neural net modules. The second rule demonstrates 
that actions of rules can be high level tasks which cause the symbolic system to pursue 
further problem solving and diagnostic reasoning. 


4 Summary 


We have outlined a novel hybrid architecture that uses fuzzy logic to integrate neural 
networks and knowledge- based systems. Our approach offers important synergistic ben- 
efits to neural nets, approximate reasoning, and symbolic processing. Fuzzy inference 
rules extend symbolic systems with approximate reasoning capabilities, which are used 
for integrating and interpreting the outputs of neural networks. The symbolic system 
captures meta-level information about neural networks and defines its interaction with 
neural networks through a set of control tasks. Fuzzy action rules provides a robust 
mechanism for recognizing the situations about neural networks that require certain con- 
trol actions. The neural nets, on the other hand, offers flexible classification and adaptive 
learning capabilities, which is crucial for dynamic and noisy environment. By combining 
neural nets and symbolic systems at their functional level through the use of fuzzy logic, 
our approach alleviates current difficulties in reconciling differences between the low-level 
data processing mechanisms of neural nets and AI systems. 

Our technical approach to achieving this high-level integration also offers several 
advantages concerning the development and the maintenance of applications based on 
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the hybrid architecture: 

1. Fuzzy logic serves as a natural bridge that brings together subsymbolic processing 
of neural networks and symbolic reasoning in knowledge- based systems. 

2. The interface between symbolic system and neural nets can be modified easily 
because it is implemented using a layered and modular approach. 

3. Meta-level knowledge about neural nets is stored in a taxonomic structure that 
facilitates the sharing of information and procedures (e.g., methods). 

4. Representing information about neural nets using a principled AI knowledge repre- 
sentation language enables the system to reason about the behavior of neural nets 
using AI deductive reasoning capabilities. 

The hybrid architecture, which supports communication and control across multi- 
ple cooperative neural nets through the use of fuzzy rules, enables the construction of 
modular, flexible, and extensible intelligent systems, reduces the effort for developing 
and maintaining such systems, and facilitates their application to complex real world 
problems that need to perform low-level data classification as well as high-level problem 
solving in the presence of uncertainty and incomplete information. 
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